Welcome to the Node Level PCA!

Welcome to the Node Level PCA!
Variance explained ratio summation of components. x-axis shows the number of components that used in analysis, while y-axis shows percentage of variance explained. It is clear that after 15 components, PCA can explain most of the data.

In my last update, I mentioned that I applied PCA for all datapoints and the results. But I have hidden an important detail for this post 😉 When we added the explained variance percentages as a result of PCA, we could only approach 40%. This means that we were only able to obtain 40% of the information about all datapoints.

Less Data, More Components

Then, in order to get more successful results, we decided to shrink the dataset we applied PCA and increase the number of components until the explained variance percentage increased. Thus, I started working on a dataset containing nearly 200 thousand datapoints containing information about 10 nodes. We chose these nodes according to their CPU usage rates. We tried to select the nodes with the most job submissions within the selected 6 days.

Afterwards, I increased the number of components one by one and made PCA analysis. As a result, we reached a 98% variance explained using 15 components.

Outlier Detection by PCA

Error rates for one of the nodes. The nodes has 16 thousand datapoints which are on x-axis, while y-axis shows the error rate.

After that, we used 15 components in all PCA analyzes we did. We decided to clean up the dataset a bit with the help of PCA analysis. For this, we used the reverse transforms of the principal components, which we obtained as a result of the PCA analysis. Then, using the distance between these values and we removed some of the outliers from the dataset. While doing this, we set the maximum error rate as 5, in this way, we kept 95% of the data in the dataset and eliminated 5%.

Count of error rates in certain ranges. Mostly datapoints have error between 0-1 or 1-2. The x-axis shows the error ranges, while y-axis shows count of datapoints that have error rate between that interval.

Didn’t We Already Find the Outliers?

But of course, the question here is why did we clean this dataset? Wasn’t our goal to find outliers anyway? Yes, our goal is to find an outlier. But our plan for doing this is as follows: Let’s train a machine learning method on timeseries data and that model can predict the features in the next step. We will use this clean dataset to train this model. Next, we will find the anomaly using the model’s predictions.

Next Step

We’ve come to the coolest part of the project! Selecting a machine learning model and training it with this dataset that we cleaned. If you’re waiting for GPUs to mess with this, see you in the next post!

Tagged with: , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.