Some questions about supervised learning, model evaluation and preprocessing

Question

I've been trying to employ some basic techniques of supervised learning on a dataset that I have and I have several questions about the overall procedure (i.e. data preprocessing, model evaluation etc).

Before I start posing the questions let me give you an insight of how my dataset looks like. The dataset is from the open ML repository, it consists of 22 different types of articles (the targets or classes) and 1079 different words (features). The aim is to classify the upcoming blurbs of these articles based on these 1079 words. Below you can see the first 5 rows of my dataset.

As you can see from the above snippet in the first 22 columns I have my targets (i.e. type of articles) and the rest of the columns belong to my word predictors. The values of my features are binary, i.e. «1» if the word appeared on the blurb, and «0» otherwise. First, I do a little of preprocess and I separate the targets from the features, I change the boolean values that correspond to the targets to 0 and 1 and I give Labels to my articles (i.e. 0:«Entertainment», 1: «Interviews» etc). In the following snippet you can see the distribution of my samples according to each different type of article.

As you can see from the above image, my dataset is unbalanced. My aim is to try the following classification algorithms and choose the best one at end: 1) GaussianNB (GNB), 2) KNearestNeighbors (KNN), 3) LogisticRegression (LR), 4) Multi-Layer Perceptron (MLP) and 5) Support Vector Machines (SVM). Before I start to preprocess my dataset in more detail I split my dataset on 70% train and 30% test and I do an Out-of-box-test of these algorithms, in the following table you can see my results

Classifier	f1score(train)	acc(train)	f1score(test)	acc(test)	Time – f1score(train)
GaussianNB	23 %	41 %	24 %	40 %	0.54 (sec)
Dummy	1 %	16 %	2 %	19 %	0.01 (sec)
1NN	4 %	18 %	8 %	24 %	0.22 (sec)
Logistic	34 %	56 %	40 %	62 %	1.04 (sec)
MLP	33 %	54 %	39 %	59 %	10.5 (sec)
SVM	19 %	46 %	32 %	57 %	3.7 (sec)

The first two columns refer to the outcome of the classifier using a 2-fold cross validation on my training set according to accuracy and f1(average = macro) metrics and the last two columns are the results of the classifiers on the test set according to the aforementioned metrics. From the above table you can observe that some of these classifiers have low performance. In my next task I am trying to preprocess my data in more detail using techniques like StandarScaler, VarianceThreshold, PCA or SelectBestK, RandomOverSampling and optimize the parameters of my classifiers with GridSearch. At the moment I can pose my first question.

Q1 Are the above techniques (except GridSearch which I completely understand) going to improve the perfomance of the classifiers for sure? I mean is there any justification that these techniques generally work for better or it is just a trial and observation procedure?

In the next code I create a pipeline that does the following manipulations, Oversampling, deleting the features with low variance, sequentially, before training the model (in this case I use only GaussianNB. Observe that I first do the splitting of my dataset into train and test and then use the oversampling.

# Case 1 splitting first and then oversampling on training set
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold()
scaler = StandardScaler()
over = RandomOverSampler()
clf = GaussianNB()
pipe = Pipeline(steps = [('over', over),('selector', selector), ('scaler', scaler),

       ('GNB', clf)])
train_new, test_new, train_new_labels,

       test_labels_new = train_test_split(features, labels, test_size = 0.3)
pipe.fit(train_new, train_new_labels)
pipe.score(test_new, test_labels_new)

The accuracy on the test set of the following pipeline is 38 % which is 2 % less than the score of the GaussianNB without this preprocess procedure, so my second question is the following.

Q2 Why these modifications deteriorated the performance of the classifier? Is there any sign from the values of my dataset or its structure that could predict this outcome?

Now, if I change the order of splitting and the oversampling I get completely different different results, for example if I run the following block of code.

# Case 2 oversampling before splitting
clf = GaussianNB()
features2, labels2 = over.fit_resample(features, labels) # First do the oversampling
train_new, test_new, train_new_labels, test_labels_new 

    = train_test_split(features2, labels2, test_size = 0.3) # Then do the splitting
train_new = selector.fit_transform(train_new)
test_new = selector.transform(test_new)                # This block of code does all
train_new = scaler.fit_transform(train_new)            # all the preprocessing.
test_new = scaler.transform(test_new)
train_new = selector2.fit_transform(train_new, train_new_labels)
test_new = selector2.transform(test_new)
clf.fit(train_new, train_new_labels)        # Fit on train
clf.score(test_new, test_labels_new)        # Evaluate on test

I get 74 % percent accuracy on the test set which is much better than before. So my question is:

Q3 Why changing the order of the splitting with oversampling changed the result that much? In general, I must do the splitting first and then do the preprocess only on my training set? For example, I understand that if first do some preprocess like scaling, PCA and then split my set, then my results would be biased since I have preprocessed the test set also, but I don't understand why this happens with oversampling also (if this is the case).

To give you another view of the above result below I can show you the learning curve of the GaussianNB on a 10-fold cross validation in the second case where I first do the oversampling and then the splitting.

As you can see from the above snippet the validation score and the training score converges on the same number which is a good indication that the model can achieve a good generalization performance.

Q4 Which of the above two cases is more reliable to give me good results on future unseen samples? Furthermore, what kind of preprocess would you suggest on the above dataset? For example, in these two runs I deleted all samples from my dataset that belonged to 2 or more classes, this modification gave me back the 84% of my initial dataset. Would it be better if I had created duplicates of these samples to prevent as much the unbalancedness of dataset?

PS: Excuse me for the long post, I don't expect to get answers for all the above questions, I would be very pleased to share your opinions if you have any insightful answers to any of the above questions! Thanks in advance!

score 1 · Accepted Answer · answered Dec 07 '21 at 23:10

Q1

Q1 Are the above techniques (except GridSearch which I completely understand) going to improve the perfomance of the classifiers for sure? I mean is there any justification that these techniques generally work for better or it is just a trial and observation procedure?

Q1 is very broad because it involves many different techniques which can each have their purpose in some cases but not in general. As a whole it's unlikely to always have a positive effect on performance, especially with a set of very different learning algorithms. But when used properly it's not a trial and error procedure, it's about using appropriate technique(s) knowing the characteristics of the data and algorithm (and the possible problems with them).

Q2

Q2 Why these modifications deteriorated the performance of the classifier? Is there any sign from the values of my dataset or its structure that could predict this outcome?

Probably most of the difference is explained by the oversampling. We see a lot of questions here on DSSE about resampling, presumably because this is a very simple technique which can sometimes have a large impact on performance... especially when used the wrong way!

It's useful to understand resampling and there might be a few rare cases where it makes sense, but imho it's more often just a lazy option to avoid studying the specific data and problem properly, and it's also a big source of confusion and mistakes (see next question). In general resampling simply doesn't improve performance, see for instance this answer.

Q3

Q3 Why changing the order of the splitting with oversampling changed the result that much? In general, I must do the splitting first and then do the preprocess only on my training set?

The most important point is to understand what is a proper evaluation method in supervised learning:

A ML model is trained to solve one particular "target problem", and this problem must be clearly defined from the start. In particular the distribution of the classes is an important part of the definition of the problem itself. For example it's not the same problem to classify between N equal classes and between N imbalanced classes, and a classifier trained for one problem is unlikely to work well for the other.
Logically, the evaluation stage must always represent the target problem as defined, including of course the class distribution. If the test set doesn't follow the "true distribution" for the target problem, the resulting performance is unreliable, as we simply don't know how the model would perform on the true distribution.

In Q3 the model is evaluated on the oversampled data. Unless you really mean to use the model on this distribution (unlikely since it's not the one in the source data), this performance doesn't mean anything for the real target problem with the original distribution. The reason why the performance is much higher is that the balanced version of the problem is much easier, so it's like solving an easy question and pretend that it's the same as the original difficult question ;)

Q4

Q4 Which of the above two cases is more reliable to give me good results on future unseen samples? Furthermore, what kind of preprocess would you suggest on the above dataset? For example, in these two runs I deleted all samples from my dataset that belonged to 2 or more classes, this modification gave me back the 84% of my initial dataset. Would it be better if I had created duplicates of these samples to prevent as much the unbalancedness of dataset?

As per the answer above, future unseen samples probably follow the same distribution as the source data, so any performance calculated on another distribution is meaningless for this.

For the same reasons deleting or adding instances can indeed modify performance, but that also means changing the target problem.

See also this question about supervised learning, in case it helps.

Some questions about supervised learning, model evaluation and preprocessing

1 Answers1

Q1

Q2

Q3

Q4