1

I am confused on which of the following should be used for standardization:

  • method 1: fit transforming training data and only transforming test data

    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform (X_test)
    
  • method 2: fit transforming both training and test data

    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    X_train = sc.fit_transform(X_train)
    # scaler_train=sc.fit(X_train)
    #X_train_sd=scaler_train.transform(X_train)
    X_test = sc.fit_transform (X_test)
    #scaler_test=sc.fit(X_test)
    #X_test_sd=scaler_train.transform(X_test)
    

this is a follow up question to: StandardScaler before and after splitting data

Zephyr
  • 997
  • 4
  • 10
  • 20

1 Answers1

2

You should only fit your scaler on training data. Your scaler is part of your model and fitting your scaler to some data can be considered as learning from this data.

Test data is used to evaluate your model on previously unseen data, so if you fit your scaler to test data, it is not "unseen" data anymore.

Adam Oudad
  • 1,083
  • 7
  • 10