Would like to ask, in which way to use SMOTE? My dataset is imbalanced and a multiclass problem. As I read in many posts, use SMOTE method only for the training dataset (X_train and y_train). Not for the test dataset (X_test and y_test). There I include validation data. How do you handle SMOTE with validation data?
df = pd.read_excel...
X=df.drop('column1',axis=1)
y=df.column1
#Training part
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
#validation part
X_train_smote, X_val, y_train_smote, y_val = train_test_split(X_train_smote, y_train_smote, test_size=0.5, random_state=42)
Is this correct?
and is it right, that the validation datasets (X_val and y_val) have both SMOTE inside? or should I make it out of the normal train test split: X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.5, random_state=42)? Im confused.