Learning on unbalanced classes
The goal of my project is to construct a classifier (learner) that will be able to recognize (and possibly predict) abnormal behavior of HPC system. Naturally abnormal behavior and faults represent relatively small part of the overall data collected from the HPC system. Traditional supervised learning approaches fail on unbalanced datasets. Unbalanced datasets mean that one label (for instance normal operation of HPC system) is far more frequent than all other labels (for instance different possible failures of system). To address this, two general approaches (strategies) are possible:
- Supervised learning on unbalanced dataset (with strategies to undress unbalanced data).
- Semi-supervised/unsupervised learning.
On unbalanced dataset supervised learning algorithms focus too much on prevalent class. There are two possible approaches to remedy this
- Training set manipulation
- Tuning of classification algorithm. Some classification alogorithms (such as SVMs) allow us to place more importance (with weights) to less frequent classes.
The idea of training set manipulation is that the number of positive and negative examples are balanced. Two approaches are relevant here:
- Oversampling: either taking (randomly) the same samples multiple times or producing new samples from existing ones. Oversampling is always performed on a smaller class and is useful when training datasets are smaller.
- Under sampling: omitting (randomly) examples from taring set or constructing new (smaller) set of training instances from existing set (for example taking centroids as new instances). Under sampling is always used on a bigger class and is useful on larger datasets. Additional usefulness of under sampling is that it reduces the size of training set and thus speeds up computation.
In the cases of extremely unbalanced dataset (where for instance one label represents 98% of the dataset) more useful approach is the one with semi-supervised/unsupervised learning. The idea of this approach is to (with for instance autoencoders) learn what is characteristic for baseline behavior and then recognize every behavior that deviates from baseline. The drawback of this approach is of course that this approach (in the basic implementation) only constructs a binary classifier (that classifies only if something is an anomaly/potential fault and not what kind of fault it is).