Different training score but same test score when using pipeline

Question

I have a problem that produce different training score when using pipeline and manual.

MANUAL :

#standardize data    
sc=StandardScaler()
X_train[['age','balance','duration']] = sc.fit_transform(X_train[['age','balance','duration']])
X_test[['age','balance','duration']] = sc.transform(X_test[['age','balance','duration']])
#applying SMOTE
X_oversampling , y_oversampling = over_sampling.SMOTE(random_state=42).fit_resample(X_train,y_train)
#modelling
model_lr = LogisticRegression()
model_lr.fit(X_oversampling, y_oversampling)
#evaluation
y_pred = model_lr.predict(X_test)
y_pred_train = model_lr.predict(X_oversampling)
print(f'Train Accuracy Score : {round(accuracy_score(y_oversampling,y_pred_train),4)}')
print(f'Test Accuracy Score : {round(accuracy_score(y_test,y_pred),4)}')
#result
Train Accuracy Score : 0.835
Test Accuracy Score : 0.82

WITH PIPELINE :

pipeline_logreg = Pipeline([('sampling', over_sampling.SMOTE(random_state=42)),
                        ('logreg', LogisticRegression())])
pipeline_logreg.fit(X_train,y_train)
**the reason i dont include standard scaler in my pipeline because ive already done it 
  manually from the code above (at #standardize data code)
#evaluation
y_pred = pipeline_logreg.predict(X_test)
y_pred_train = pipeline_logreg.predict(X_train
print(f'Train Accuracy Score : {round(accuracy_score(y_train,y_pred_train),4)}')
print(f'Test Accuracy Score : {round(accuracy_score(y_test,y_pred),4)}')
#result
Train Accuracy Score : 0.8261
Test Accuracy Score : 0.82

So why the result is different on training accuracy? The test accuracy score was the same.

Welcome to DataScienceSE. There can be several reasons: in the best case, the pipeline simply prevents some overfitting. But I would be more concerned about other issues: first, is it binary classification? How many classes, are they balanced? Because accuracy could mask other problems. Also how many instances? Did you check that oversampling brings any advantage? — Erwan, Oct 14 '22 at 09:39
@Erwan yes it's binary classification, so im trying to use logistic regression algorithm. The data is not balanced 0 : 1 = 88.7% : 11.3%. Thats why i'm doing oversampling, and oversampling sure benefit my machine learning model. Because my goal is to maximize recall metrics more than precision. If i not doing oversampling my recall score will be very low and get high precision. I'm new to data science world, so i'm learning to use other method (example: pipeline) to build my model. But turns out i got the different result (only on train score, the test result is same). — Jovian Aditya, Oct 14 '22 at 10:12
@Erwan i know that when handling imbalanced data you should use roc_auc score to evaluate. its not wise to use accuracy. I'm also trying to use roc_auc score but it also got the different training score result when using pipeline (the test score also the same). Btw i have another question from your comment. If i'm already done oversampling for my imbalanced data should i still consider to user roc_auc score to evaluate more than accuracy? and btw my data is binary classification with two classes — Jovian Aditya, Oct 14 '22 at 10:16
@Erwan i'm also using that method when hyperparameter tuning. the first one i'm using fit(X_oversampling, y_oversampling) to my randomized searh cv. the second method i'm using pipeline (in that pipeline i'm also using smote) and then fit(X_train, y_train). It also gave me different training score between that two methods. The result shows that its better to use pipeline because the gap between roc_auc train and roc_auc test is more small than when using first method. Which one would you choose? Sorry for many question, i'm new to the data science world and dont have someone to teach me :( — Jovian Aditya, Oct 14 '22 at 10:26
Ok so the first issue I see is that accuracy is only 82%, and if the majority class is 88% you should obtain at least 88% (basically a naive model can obtain 88% accuracy just by predicting always the majority class). Can you show the confusion matrix, or classification report with precision/recall? That would help to understand what happens. — Erwan, Oct 14 '22 at 10:53
@Erwan i think my accuracy score only 82% because in that model i've already done oversampling with smote. So the sum of 0 and 1 is the same. This is my confusion matrix [[6575 1377] [ 251 840]] i've also already trying to modelling without oversampling and the train accuracy score is 91% and the test accuracy is around 89%. But it has very low recall score.. since i want to maximize the recall score i doing oversampling smote and turns out it was effectively increase recall but drop accuracy and precision. I think it was make sense. — Jovian Aditya, Oct 14 '22 at 11:00
Ok so first: in the code above, the oversampling is correctly done only on the training set. It would be a mistake to apply resampling on the test set (or to do it before splitting). Ok, I agree that if you really want to maximize recall it might make sense to apply oversampling like this. But note that this means that the model is forced to label many instances as class 1, thus causing a lot of false positive errors (precision is only 38%, i.e. instances predicted as 1 are more likely to be 0 than 1 really). — Erwan, Oct 14 '22 at 15:46
It's also worth noting that you could keep increasing recall even more by oversampling even more class 1 (or undersampling class 0), but of course it would cost even more in precision. The extreme case would be to force the model to label everything as 1: recall would be 100% but precision would be only 12%... and the model would be really pointless ;) — Erwan, Oct 14 '22 at 15:48
@Erwan yeah i know that it would make precision score really bad. The final result i got from hyperparameter tuning is 78% recall and 38% precision. This dataset is about campaign of term deposit in bank. Assume this bank have many people that can call customer to offer term deposit through phone call, but in the data it shows from 45211 customers only 11,3% want to buy term deposit. So i think if the bank have many resource of people that can offer term deposit and the goal is want to increase the number of customers who buy term deposit, its not a big deal to make precision score low. — Jovian Aditya, Oct 15 '22 at 01:23
@Erwan and i got auc-proba test score 87.7% and the auc-proba train score 89.7%. I think it shows that my model with logistic regression is not overfit (cause i' m assuming the 2% gap is not a big deal). Any opinion on this? — Jovian Aditya, Oct 15 '22 at 01:26
I agree that there's no or little overifitting. These AUC scores look very high though, I'm surprised it would be this high with the low precision. — Erwan, Oct 15 '22 at 09:13

Different training score but same test score when using pipeline

0 Answers0

Linked