1

I'm busy working through Aurélien Géron's book. (Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow)

The idea is to split the data into train and test set as early as possible in order to avoid data snooping bias. Afterwards changes are made to the data.

My question is that since changes were made to the training set, I assume the same changes(dropping columns, filling NA rows, converting categorical to numerical, etc) should be made to the test set before training and evaluating? If that is the case, what is the correct way to perform this? Write everything as a function and run it on both, which seems a bit counter intuitive to working with notebooks? Is there a built-in function that I'm not aware of?

1 Answers1

1

Most transformations are available as built in classes in scikit learn. You can assemble the classes into a scikit pipeline and have your train data pass through it as part of the pipeline's fit operation. When you are ready to evaluate your test data you can simply run the pipeline's predict operation. This ensures that the test data passes through the exact same transformation workflow as your train data.

Jayaram Iyer
  • 815
  • 5
  • 8