Standardization on training and split data

Question

I am confused on which of the following should be used for standardization:

method 1: fit transforming training data and only transforming test data

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)

method 2: fit transforming both training and test data

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
# scaler_train=sc.fit(X_train)
#X_train_sd=scaler_train.transform(X_train)
X_test = sc.fit_transform (X_test)
#scaler_test=sc.fit(X_test)
#X_test_sd=scaler_train.transform(X_test)

this is a follow up question to: StandardScaler before and after splitting data

score 2 · Answer 1 · answered Sep 18 '20 at 05:40

You should only fit your scaler on training data. Your scaler is part of your model and fitting your scaler to some data can be considered as learning from this data.

Test data is used to evaluate your model on previously unseen data, so if you fit your scaler to test data, it is not "unseen" data anymore.

Standardization on training and split data

1 Answers1