Machine Learning Steps

Question

Which of the below set of steps options is the correct one when creating a predictive model?

Option 1:

First eliminate the most obviously bad predictors, and preprocess the remaining if needed, then train various models with cross-validation, pick the few best ones, identify the top predictors each one has used, then retrain those models with those predictors only and evaluate accuracy again with cross-validation, then pick the best one and train it on the full training set using its key predictors and then use it to predict the test set.

Option 2:

First eliminate the most obviously bad predictors, then preprocess the remaining if needed, then use a feature selection technique like recursive feature selection (eg. RFE with rf ) with cross-validation for example to identify the ideal number of key predictors and what these predictors are, then train different model types with cross-validation and see which one gives the best accuracy with those top predictors identified earlier. Then train the best one of those models again with those predictors on the full training set and then use it to predict the test set.

What do you mean by then preprocess the remaining if needed? Is it data cleaning? — Dawny33, Feb 04 '16 at 10:09
I meant to preprocess the remaining features that you think are useful. By preprocessing I mean, do scaling, or transformations like log, or others if and as needed. — A K, Feb 04 '16 at 11:20
Ahh, as I expected :) Anyways, I have written the answer with the workflow which me and my team generally follow! — Dawny33, Feb 04 '16 at 11:22

Dawny33 · Accepted Answer · 2016-03-13T14:09:37.667

17

I found both of your options slightly faulty. So, this is generally (very broadly) how a predictive modelling workflow looks like:

Data Cleaning: Takes the most time, but every second spent here is worth it. The cleaner your data gets through this step, the lesser would your total time spent would be.
Splitting the data set: The data set would be splitted into training and testing sets, which would be used for the modelling and prediction purposes respectively. In addition, an additional split as a cross-validation set would also need to be done.
Transformation and Reduction: Involves processes like transformations, mean and median scaling, etc.
Feature Selection: This can be done in a lot of ways like threshold selection, subset selection, etc.
Designing predictive model: Design the predictive model on the training data depending on the features you have at hand.
Cross Validation:
Final Prediction, Validation

edited Mar 13 '16 at 14:09

answered Feb 04 '16 at 10:27

Dawny33

8,296
12
48
104

I think your steps match my option 2. My understanding is that as part of the Feature Selection step, we can run a recursive feature elimination function (RFE) using random forests for example with cross-validation to determine the best number of predictors and what they are and then use those predictors to train several algorithms with cross-validation and compare accuracy to get the best model that uses those best predictors. What do you think? – A K Feb 04 '16 at 11:24
@AndrewKostandy Yeah, the subset selection algorithm for feature selection almost works the same way :) – Dawny33 Feb 04 '16 at 11:30
You're welcome. I'm currently learning for an exam where one of the standard questions of the professor is "what is the first think you do after obtaining and cleaning the data?" :-) – Martin Thoma Feb 04 '16 at 15:12
@Dawny33 Wouldn't you want to perform transformations, scaling etc before splitting your dataset into training and testing? – Minu Jun 10 '16 at 20:12
@Minu Sometimes before and sometimes after :) – Dawny33 Jun 11 '16 at 04:43
@Dawny33 Could you give me a scenario where you would want to perform transformation and scaling after splitting? I'm assuming when you say after splitting, you apply those transformations only to the training data set. – Minu Jun 11 '16 at 12:56
@Minu I'm assuming when you say after splitting, you apply those transformations only to the training data set <-- Yes – Dawny33 Jun 11 '16 at 12:59
1

Any reason why you'd perform variable transformations and scaling only to the training data? How would you then adjust the test data to match? Just curious. – Minu Jun 11 '16 at 13:10
@Minu Scaling is done on the complete dataset. Sorry that I confused you somewhere :) – Dawny33 Sep 21 '17 at 09:36
NO, Pre-processing IS NOT done Before splitting the data. Please Investigate the idea of Data Leakage. – mccurcio Apr 30 '20 at 19:56

score 3 · Answer 2 · answered Feb 10 '16 at 12:40

Where the feature selection finds a place in your pipeline depends on the problem. If you know your data well, you can select features based on this knowledge manually. If you don't - the experimentation with the models using cross validation may be best. Reducing number of features a priory with some additional technique like chi2 or PCA may actually reduce model accuracy.

In my experience with text classification with SGD classifier for example leaving all hundred thousands words encoded as binary features brought better results compared to reducing to a few thousands or hundreds. Training time is actually faster with all features as feature selection is rather slow with my toolset (sklearn) because it is not stochastic like SGD.

Multicollinearity is something to watch out for, but the feature interpretability might equally be important.

Then people report getting best result with ensembles of models. Each model capturing a particular part of information space better than the others. That would also preclude you from selecting the features before fitting all models you'd include into your ensemble.

Machine Learning Steps

2 Answers2

Linked