Preprocessing pipeline and train/test set separation
In last blog post I talked about over/under sampling as a method to address unbalanced datasets. As a data transformation method it is important that we are cautious when evaluating performance of classifiers when we are performing over/under sampling. In this blog post I will briefly discuss the correct way to construct data transformation pipeline so that it will not influence the evaluation of the classifier.
It is a well-known principle that, when evaluating the classifier (or a supervised ML model in general), the model has to be trained only on the training set and then evaluated on previously unseen test data. Doing so enables us to correctly estimate the generalization power of the model as opposed to its ability to over-fit the training set.
The same principle should also apply to every data transformation that takes into account any information from the dataset. These include:
- One-not encoding
- Over-under sampling
Separation of train/test set in preprocessing – just like in machine learning – means that the transformation pipeline has to be trained only on train set. It is then applied to test set. In the case of transformators included above this means:
- Mean, standard deviation, min, max are only computed on train set. That means for instance that although we normalize train set to values between 0 and 1 we can have values outside this interval.
- Only classes in train set are taken into account in one-hot encoding. This means that we (usually) group all previously unseen values into the same group.
- In the case of over-under sampling the transformation is performed only on train set. The model is than trained and evaluated on (unaltered) test set.
In the case of my project I have to take into account all these methods:
- Normalization of numeric data
- One-hot encoding of categorical data
- Over-under sampling of train set.