SMOTE train test split with validation data

Question

Would like to ask, in which way to use SMOTE? My dataset is imbalanced and a multiclass problem. As I read in many posts, use SMOTE method only for the training dataset (X_train and y_train). Not for the test dataset (X_test and y_test). There I include validation data. How do you handle SMOTE with validation data?

df = pd.read_excel...
X=df.drop('column1',axis=1)
y=df.column1
#Training part
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
#validation part
X_train_smote, X_val, y_train_smote, y_val = train_test_split(X_train_smote, y_train_smote, test_size=0.5, random_state=42)

Is this correct?

and is it right, that the validation datasets (X_val and y_val) have both SMOTE inside? or should I make it out of the normal train test split: X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.5, random_state=42)? Im confused.

score 1 · Accepted Answer · answered Oct 31 '20 at 13:09

1

Problem in applying smote on data and than applying the split (test/or Validation does not matter) is that you could suffer from data leakage. Meaning that some Information from the Train could Spill over to the future and falsly give good predictions. I would advise seperating smote data Generation process for all 3 data sets. So do the splits, than do the data Generation.

answered Oct 31 '20 at 13:09

Noah Weber

5,669
1
12
26

How do you mean that? Should do this: "smote = SMOTE(random_state=42) X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)" in the end after the train test split and the validation split? – martin Oct 31 '20 at 13:35
In general do you mean that smote should be handled afterwards the 2 times splitting in training and after splitting in validation data? – martin Oct 31 '20 at 13:44
yes, exactly do it after the Splittings to be on the safe side. – Noah Weber Oct 31 '20 at 14:13
Thank you very much. I changed it, only the X_train and y_train are with smote. The X_val and y_val are without smote. The f1 value decreased. – martin Oct 31 '20 at 14:36
well ofcourse, you are not leaking. Thats to be expected. Think About model in prodution that would be more stabile. – Noah Weber Oct 31 '20 at 14:43

SMOTE train test split with validation data

1 Answers1