Swimming in the data lake
Data lake is a storage of large quantities of data that has little to no structure. It combines data from different sources that were originally not intended to be a part of a bigger monitoring infrastructure.
In the case of supercomputers data lake is a combination of different monitoring systems/services that were each originally intended to only provide information (diagnostic) of a single computer component (such as CPU temp and frequency, storage health, network status …). In order to provide meaningful insights (with machine learning for instance) the data first has to be structured and (as much as possible) unified. Because data collection (and storage) was originally not done with the intention of data mining there are some difficulties that have to be taken into consideration:
- Different sampling frequencies/unaligned timestamps. Different monitoring services (or even different sensors) have different sampling frequencies or/and unaligned starts of sampling.
- Different services available on different subsections. At CINECA there are three mayor supercomputer systems: MARCONI, GALILEO and DAVIDE. They all have different hardware architectures and consequently different monitoring solutions. Even on a single supercomputer systems there can be multiple partitions with different hardware architectures (MARCONI for instance has skylake and knights landing partitions which have sensors with different characteristics).
- Missing data. Even on single partition of a supercomputer system (or a single node) not all reporting services are available all the time.
There are many possible approaches when handling loosely structured data. The main approaches and necessary considerations include:
- Different sampling frequencies are usually mitigated by time aggregation. The tradeoff in time aggregation is between data granularity and missing values. Time aggregates over longer time periods usually have fewer missing features (attributes) but also use information about temporal development of the system. In any case the minimal period over which data is aggregated is governed by the attribute(s) with the lowest sampling frequencies.
- With different monitoring services the tradeoff is between the building a single model (for the whole system) and building a separate model for each node/partition. When preparing the data for a more general model there can be a lot of missing values. Inputting these values introduces additional noise to the dataset which harms the effectiveness of the model trained on such data. On the other hand a single node (partition) on its own might not have enough data to construct a usable model.
- With missing data the tradeoff is between removing the feature with a lot of missing values, removing the timestamp (or sequence) with a lot of missing features or imputing missing values. Removing timestamp or features reduces the size of the dataset wile imputing the values introduces noise. The best balance between these two strategies is thus to impute values if there are only a few missing values and remove the features (or sequences) otherwise.
The end of all data preparation steps is to create a training set that will:
- Be as big as possible (removed as few data points as possible)
- Contain the most of the original information
- Contain as little noise as possible.
In the next blog post I will delve deeper into particular features of described data set (data set about supercomputer monitoring services).