Curse of Dimensionality! Let’s go deep into the data

Hello everyone! It’s been a while since my first post. And I look forward to giving you more detailed information about the data I’m working on.
How is the data structured?
There are multiple resources that can be used as data resources in the system analytics project. These are EAR, XALT, SLURM and Prometheus. I am working with Prometheus data. Prometheus is an open-source systems monitoring and alerting toolkit. The data coming from Prometheus contains a lot of information for HPC systems at the node level. Also, Prometheus creates a new datapoint about the node every 30 seconds. This means that we can get information about the nodes in the Lisa cluster every half minute. In order to obtain a uniform dataset, I worked on 6 days of data obtained from more than 200 CPU nodes. This means more than 4 million datapoints. Each datapoint contains 70 different features from file system usage to core temperatures and many more. This data, of course, was waiting for me in a compressed state in a database. (I am sharing a small section of the first datapoint for you below.) I needed to decompress this data first and then make it suitable for PCA analysis, which would allow us to extract useful features for our project.
id : 1
timestamp : 2022-07-08 05:16:17
node : r11n1
node_arp_entries : {'admin0': '15'}
node_boot_time_seconds : 1657200000.0
node_context_switches_total : 31588000.0
node_disk_io_now : {'sda': '0'}
node_disk_read_bytes_total : {'sda': '918531584'}
node_disk_writes_completed_total : {'sda': '78263'}
node_disk_written_bytes_total : {'sda': '6146845696'}
node_intr_total : 39498500.0
node_load1 : 0.0
node_load15 : 0.0
node_load5 : 0.0
node_memory_Active_bytes : 920846000.0
node_memory_Dirty_bytes : 0.0
node_memory_MemFree_bytes : 98187700000.0
node_memory_Percpu_bytes : 7012350.0
node_netstat_Icmp_InErrors : 0.0
node_netstat_Icmp_InMsgs : 57.0
node_netstat_Icmp_OutMsgs : 57.0
node_netstat_Tcp_InErrs : 0.0
node_netstat_Tcp_InSegs : 443546.0
node_netstat_Tcp_OutSegs : 435529.0
node_netstat_Tcp_RetransSegs : 2.0
node_netstat_Udp_InDatagrams : 1839.0
node_netstat_Udp_InErrors : 0.0
node_netstat_Udp_OutDatagrams : 3166.0
node_network_receive_bytes_total : {'admin0': '2631816125', 'eno2': '0', 'lo': '3771950'}
node_network_receive_drop_total : {'admin0': '0', 'eno2': '0', 'lo': '0'}
node_network_receive_multicast_total : {'admin0': '2', 'eno2': '0', 'lo': '0'}
node_network_receive_packets_total : {'admin0': '702717', 'eno2': '0', 'lo': '32833'}
node_network_transmit_bytes_total : {'admin0': '145355117', 'eno2': '0', 'lo': '3771950'}
node_network_transmit_packets_total : {'admin0': '408320', 'eno2': '0', 'lo': '32833'}
node_procs_blocked : 0.0
node_procs_running : 7.0
node_rapl_package_joules_total : {'0': '52516.334246'}
node_thermal_zone_temp : {'x86_pkg_temp_0': '26'}
node_time_seconds : 1657250000.0
node_udp_queues : {'v4_rx': '0', 'v4_tx': '0', 'v6_rx': '0', 'v6_tx': '0'}
nvidia_gpu_duty_cycle : None
nvidia_gpu_fanspeed_percent : None
nvidia_gpu_memory_used_bytes : None
nvidia_gpu_power_usage_milliwatts : None
nvidia_gpu_temperature_celsius : None
surfsara_ambient_temp : 24.0
surfsara_power_usage : 36.0
up : 1.0
Yes, this is the small portion of a single datapoint!
A bit of parallelization
In order to use the data in PCA analysis, all of the features had to be numerical values and had to be normalized. First of all, I wrote a serial code for this. It took approximately 20 hours for this code to run in a database containing more than 4 million datapoints in total. But of course, our time is precious! So I parallelized the code and made it run 4x faster even on my local computer. (I’m going to be a little proud of myself here because I’m doing parallel programming for the first time :))
PCA Results
After pre-processing, I got 4787342 datapoints, each containing 74 features. Then I normalized this dataset and applied PCA. For ease of visualization in PCA, I primarily used 3 components. While doing this study, we expected to see some clusterings as a result of PCA, but the result has a linear structure. Could it be that it somehow made this inference even though we didn’t include the timestamps in our PCA analysis?
We also obtained other interesting results. Looking at the PCA components, we noticed that nothing about the file system is useful for explaining the data (i.e. almost), apart from that, things that produce output with similar units such as node1, node5 and node15 are almost equally important and rank higher when we order the components.
What is next?
Based on this study, we decided to continue with a single node, to get rid of some outliers with PCA analysis, and then to continue with a new dataset. I will share the details with you in my next post!
Leave a Reply