1
X = all features from dataset

y = all target from dataset

X_train = features that already using train_test_split approach

y_train = target that already using train_test split approach

So my question is which one should I choose if I would like to do hyperparameter tuning? I have imbalanced data. In this case I would like to make pipeline that contains smote and the algorithm. I read someone who said that you should do oversampling on each fold of cross validation. Assuming when I am using randomized search CV --> that also have cross validate I am decided to run smote in pipeline. But I am unsure which data should I fit after I run the code.

fit(X,y) or fit(X_train, y_train)
Ethan
  • 1,633
  • 9
  • 24
  • 39

2 Answers2

2

The recommended approach is to use cross validation on the training dataset (X_train, y_train) for hyperparameter tunning and oversampling on each fold of cross validation.

The code would something like this:

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold

X_train, y_train, X_test, y_test = train_test_split(X, y)

pipeline = Pipeline([("smote", SMOTE()), ("rf", RandomForestClassifier())])

kf = StratifiedKFold()

rscv = RandomizedSearchCV(estimator=pipeline, cv=kf) rscv.fit(X_train, y_train)

Brian Spiering
  • 21,136
  • 2
  • 26
  • 109
  • thank you for your answer that help me to understand. So when i'm applying smote into pipeline, it applying oversampling in each fold (because randomized search CV using cross validation technique). But can i ask you one more question regarding oversampling? https://datascience.stackexchange.com/questions/115218/different-training-score-but-same-test-score-when-using-pipeline?noredirect=1#comment116376_115218 in that case i dont want to tuning hyperparameter.. I just want to evaluate my model so i think it doesnt matter when i'm applying smote before or applying it in pipeine. – Jovian Aditya Oct 15 '22 at 03:21
0

None:

  • Certainly not the whole dataset (X,y) because this would cause data leakage and invalidate the evaluation.
  • The training set (X_train, y_train) should be used only for training.

The solution is to split the training set into a training set and validation set. Equivalently you can use cross-validation in the full training set, since the CV process would take care of splitting the data.

If interested, this is another explanation about using a validation set for parameter tuning.

If you resample (I don't recommend it, at least not without a good reason), you should never do it on the validation set or the test set.

Erwan
  • 25,321
  • 3
  • 14
  • 35