I've been trying to employ some basic techniques of supervised learning on a dataset that I have and I have several questions about the overall procedure (i.e. data preprocessing, model evaluation etc).
Before I start posing the questions let me give you an insight of how my dataset looks like. The dataset is from the open ML repository, it consists of 22 different types of articles (the targets or classes) and 1079 different words (features). The aim is to classify the upcoming blurbs of these articles based on these 1079 words. Below you can see the first 5 rows of my dataset.
As you can see from the above snippet in the first 22 columns I have my targets (i.e. type of articles) and the rest of the columns belong to my word predictors. The values of my features are binary, i.e. «1» if the word appeared on the blurb, and «0» otherwise. First, I do a little of preprocess and I separate the targets from the features, I change the boolean values that correspond to the targets to 0 and 1 and I give Labels to my articles (i.e. 0:«Entertainment», 1: «Interviews» etc). In the following snippet you can see the distribution of my samples according to each different type of article.
As you can see from the above image, my dataset is unbalanced. My aim is to try the following classification algorithms and choose the best one at end: 1) GaussianNB (GNB), 2) KNearestNeighbors (KNN), 3) LogisticRegression (LR), 4) Multi-Layer Perceptron (MLP) and 5) Support Vector Machines (SVM). Before I start to preprocess my dataset in more detail I split my dataset on 70% train and 30% test and I do an Out-of-box-test of these algorithms, in the following table you can see my results
Classifier | f1score(train) | acc(train) | f1score(test) | acc(test) | Time – f1score(train) |
---|---|---|---|---|---|
GaussianNB | 23 % | 41 % | 24 % | 40 % | 0.54 (sec) |
Dummy | 1 % | 16 % | 2 % | 19 % | 0.01 (sec) |
1NN | 4 % | 18 % | 8 % | 24 % | 0.22 (sec) |
Logistic | 34 % | 56 % | 40 % | 62 % | 1.04 (sec) |
MLP | 33 % | 54 % | 39 % | 59 % | 10.5 (sec) |
SVM | 19 % | 46 % | 32 % | 57 % | 3.7 (sec) |
The first two columns refer to the outcome of the classifier using a 2-fold cross validation on my training set according to accuracy and f1(average = macro) metrics and the last two columns are the results of the classifiers on the test set according to the aforementioned metrics. From the above table you can observe that some of these classifiers have low performance. In my next task I am trying to preprocess my data in more detail using techniques like StandarScaler, VarianceThreshold, PCA or SelectBestK, RandomOverSampling and optimize the parameters of my classifiers with GridSearch. At the moment I can pose my first question.
Q1 Are the above techniques (except GridSearch which I completely understand) going to improve the perfomance of the classifiers for sure? I mean is there any justification that these techniques generally work for better or it is just a trial and observation procedure?
In the next code I create a pipeline that does the following manipulations, Oversampling, deleting the features with low variance, sequentially, before training the model (in this case I use only GaussianNB. Observe that I first do the splitting of my dataset into train and test and then use the oversampling.
# Case 1 splitting first and then oversampling on training set
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold()
scaler = StandardScaler()
over = RandomOverSampler()
clf = GaussianNB()
pipe = Pipeline(steps = [('over', over),('selector', selector), ('scaler', scaler),
('GNB', clf)])
train_new, test_new, train_new_labels,
test_labels_new = train_test_split(features, labels, test_size = 0.3)
pipe.fit(train_new, train_new_labels)
pipe.score(test_new, test_labels_new)
The accuracy on the test set of the following pipeline is 38 % which is 2 % less than the score of the GaussianNB without this preprocess procedure, so my second question is the following.
Q2 Why these modifications deteriorated the performance of the classifier? Is there any sign from the values of my dataset or its structure that could predict this outcome?
Now, if I change the order of splitting and the oversampling I get completely different different results, for example if I run the following block of code.
# Case 2 oversampling before splitting
clf = GaussianNB()
features2, labels2 = over.fit_resample(features, labels) # First do the oversampling
train_new, test_new, train_new_labels, test_labels_new
= train_test_split(features2, labels2, test_size = 0.3) # Then do the splitting
train_new = selector.fit_transform(train_new)
test_new = selector.transform(test_new) # This block of code does all
train_new = scaler.fit_transform(train_new) # all the preprocessing.
test_new = scaler.transform(test_new)
train_new = selector2.fit_transform(train_new, train_new_labels)
test_new = selector2.transform(test_new)
clf.fit(train_new, train_new_labels) # Fit on train
clf.score(test_new, test_labels_new) # Evaluate on test
I get 74 % percent accuracy on the test set which is much better than before. So my question is:
Q3 Why changing the order of the splitting with oversampling changed the result that much? In general, I must do the splitting first and then do the preprocess only on my training set? For example, I understand that if first do some preprocess like scaling, PCA and then split my set, then my results would be biased since I have preprocessed the test set also, but I don't understand why this happens with oversampling also (if this is the case).
To give you another view of the above result below I can show you the learning curve of the GaussianNB on a 10-fold cross validation in the second case where I first do the oversampling and then the splitting.
As you can see from the above snippet the validation score and the training score converges on the same number which is a good indication that the model can achieve a good generalization performance.
Q4 Which of the above two cases is more reliable to give me good results on future unseen samples? Furthermore, what kind of preprocess would you suggest on the above dataset? For example, in these two runs I deleted all samples from my dataset that belonged to 2 or more classes, this modification gave me back the 84% of my initial dataset. Would it be better if I had created duplicates of these samples to prevent as much the unbalancedness of dataset?
PS: Excuse me for the long post, I don't expect to get answers for all the above questions, I would be very pleased to share your opinions if you have any insightful answers to any of the above questions! Thanks in advance!