Why not train the final model on the entire data after doing hyper-paramaeter tuning basis test data and model selection basis validation data?

Question

By entire data I mean train + test + validation

Once I have fixed my hyperparameter using the validation data, and choose the model using the test data, won't it be better to have a model trained on the entire data so that the parameters are better trained rather than having the model trained on just train data

You should NEVER fix your hyperparameters using your test data. You just spoiled your entire experiment by removing your blind control group (test set). — JahKnows, Apr 03 '17 at 13:06
@JahKnows After I am done tuning the hyper parameters for a model, I don't understand the harm except that I will not know how good it generalizes over a different dataset. How did I spoil my experiment? am i missing something? — Apoorva Abhishekh, Apr 03 '17 at 14:32
See also https://datascience.stackexchange.com/q/33008/55122 — Ben Reiniger, Jul 28 '21 at 18:28

Ricardo Cruz · Accepted Answer · 2018-01-05T23:41:07.803

8

The question is under a wrong assumption. Many people do what you say they "cannot" do.

In fact, the grid search implementation in the widely deployed sklearn package does just that. Unless refit=False, it will retrain the final model using the entire data.

I think for some hyperparameters this might not be very desirable, because they are relative to the volume of data. For instance, consider the min_samples_leaf pre-pruning tactic for a decision tree. If you have more data, the pre-pruning may not perform as you want.

But again, most people do in fact retrain using the entire data after cross-validation, so that they end up with the best model possible.

Addendum: @NeilSlater says below that some people perform hold-out on top of CV. In other words, they have a train-test split and then perform model selection on the training. According to him, they re-train using the original training set split, but not the testing set. The testing set is then used to perform a final model estimation. Personally, I see three flaws on this: (a) it does not solve the problem I mentioned with some hyperparameters being dependent on the volume of training since you are re-training anyway, (b) when testing many models, I prefer more sophisticated methods such as nested cross validation so that no data goes to waste, and (c) hold-out is an awful method to infer how a model will generalize when you have little data.

edited Jan 05 '18 at 23:41

answered Apr 06 '17 at 17:41

Ricardo Cruz

3,410
1
15
34

The feature is about re-using cross-validation, it is still not advisable to re-use the test data, because you only have assumptions from cross-validation and no measure of performance. Any bug or problem parameter (such as the example you give) could otherwise make the model undetectably worse. – Neil Slater Apr 11 '17 at 07:45
@NeilSlater I don't understand what you said here: "The feature is about re-using cross-validation" – Ricardo Cruz Apr 11 '17 at 16:14
"feature" -> the refit option of the GridSearchCV function. It doesn't re-fit to include held-out test data (it doesn't even get to see that data). – Neil Slater Apr 11 '17 at 19:09
@NeilSlater, you can easily check the code for yourself if you do not believe me (here). If refit=True, then "fit the best estimator using the entire dataset". – Ricardo Cruz Apr 11 '17 at 21:07
Yes, that is the entire training dataset that you fed into the GridSearchCV command, that it does k-fold cv and search against. In normal use, you would still hold back a separate test data set for a final stage evaluation, the function does not somehow fetch and include that. – Neil Slater Apr 11 '17 at 21:27
1

@NeilSlater, that is not my experience, but I have added your experience to my comment so that others can benefit from it. Thank you. – Ricardo Cruz Apr 12 '17 at 08:46
@RicardoCruz Do you have a source for this: "(c) hold-out is an awful method to infer how a model will generalize." I'd love to learn more. – rwking Jan 05 '18 at 18:44
@rwking, If you have lots of data, then 70-30 holdout is fine. But cross-validation is essential for much of machine learning. Check this article to see how important cross-validation is. – Ricardo Cruz Jan 05 '18 at 23:42
Is there research on when retraining on the full dataset improves upon a model (e.g. evaluating against a 4th holdout)? – Max Ghenis Aug 29 '18 at 20:09
@rwking He asserts that because holdout validation throws away data, and additionally you are validating on a subset of the training set. KFold is much more attractive because you can train on the entire training set, and the validation set is the entire training set.
However, I disagree with the OP that "hold-out is an awful method to infer how a model will generalize when you have little data". In fact, for time-series models, holdout future validation is much more informative and reliable than using KFold.
– Corey Levinson Nov 06 '19 at 05:39

score 1 · Answer 2 · answered Apr 03 '17 at 12:42

1

Yes you can.

As test data supposed to come from similiar distribution to train data, you won't break your model. If you have trained model properely, then you will notice no significant change(except better accuracy metric on previous test / validation data).

But it's rareraly true that test data comes from precisely same distribution as train data, so in real application case scenario you may get better generalizability of your model.

answered Apr 03 '17 at 12:42

Il'ya Zhenin

111
2

1

The problem with including the test data as per this suggestion, is now you have no measurement of how well the model generalises. Yes you might expect it to generalise better. However, you don't know, because you have removed your ability to measure it. I suggest add that caveat, and explain why sometimes this is still ok (e.g. when receiving new data, you might be able to treat that as a new test set and build up a new measurement over time, whilst hopefully taking advantage of the better model - it's a risk though) – Neil Slater Apr 03 '17 at 13:33
@NeilSlater : I understand that I have removed my ability to measure how good it will generalize on a different dataset. But If i have both test and holdout, even after hyper parameter tuning, I can train my model again on train + test, I will be still left with holdout to check if how my model generalizes. I know this is different than what i asked. But i just want to know your view. – Apoorva Abhishekh Apr 03 '17 at 14:35
1

@ApoorvaAbhishekh: If you had yet another holdout set of data, then yes you can use that as the new test set against the new model trained on new_train = {old train, old cv, old test}. Then you would get a measure of generalisation. Although you have to be careful not to over-use it - if it turns out there is a problem with the new combined set (e.g. early stopping needs to change due to more data) then you cannot also use it as the new cv set . . . unless you have yet another holdout set in reserve . . . – Neil Slater Apr 03 '17 at 15:00
@NeilSlater In theory you need new dataset to know performance. In practice you might be sure that your model performs well, as you worked with it a fckn long time and you know what to expect. But usually you also do have other data to check performance of a model, for example, in computer vision - unlabeled data. It is not right, but it works too. Sure it is extreme case, but I want to say that it might work. Myself I always have test set which I do not mix into training. – Il'ya Zhenin Apr 04 '17 at 10:01

score 1 · Answer 3 · answered Apr 03 '17 at 14:24

The answer for this question depends on the training algorithm ( technology) that you use. For example, I have seen some approaches in ensemble classification where training and validation ( but not testing) sets are combined at the end. It is very important to know that even validation is used mainly to decide the hyper parameters some of these hyper parameters can be a function of the used data for training. For example, in DNN validation used to know when to stop, because overfitting can happen as a result of keeping tuning the parameters ( weights) of the network, we need a way to know when to stop. Without the validation set you will be walking blindly in the training process. In the other hand, if you use exactly the same number of iterations as specified before, there is a high prob that you will not gain from these additional samples. Testing set should not be touched at all, as indicated above without the testing set you will have no method to evaluate your model. This is gambling, you CAN NOT deliver any model or solution without the estimation of its accuracy on the true data distribution ( which represented by the testing data) .

I meant after training on train data, hyper parameter tuning on validation data and choosing the model basis the test data, Can I train my model basis the entire data. Or, Can i combine the training data and validation data after I am done with hyperparameter tuning and estimate the accuracy using the test data. Apologies for writing it incorrectly. Have corrected it now. — Apoorva Abhishekh, Apr 03 '17 at 14:44

score 0 · Answer 4 · answered Nov 09 '23 at 21:55

Never do anything with the test dataset. I am surprised this question has many positive votes.

Overall. The point of having the test set on the sidelines is to evaluate ML model generalization capabilities on unseen observations.

Lastly. Try to use stratified techniques if you want proportional classes on your train, validation, and test sets.

Why not train the final model on the entire data after doing hyper-paramaeter tuning basis test data and model selection basis validation data?

4 Answers4