- Training dataset is something that you use to train your
dataset(data which the model learns upon).
- Validation dataset is that part of your data which is not used to
train your model, but to check the model's performance and optimize
it further. Here, as the model is optimized based on its performance
on the validation dataset, so according to experts, your model though
indirectly, has 'seen' the validation data as you might unknowingly
create a bias towards the validation data by optimizing the model to
perform better on the validation data only.
- The stage of testing the models, comes after all the optimization(by
that time your model has processed (but not trained on) the validation
data). So, once you think your model is
optimized (you simply can't know it). Now you can and should use validation data also to train
the model along with the training data.
- The testing data on the other hand is that part of your data which is
completely unknown to model(that is neither it is used to train nor
to validate the model). It is the completely unbiased performance
check of your model.
Usually in case of competitions, the data you are given, since it has labels, it should be used to train and optimize you model. But testing should always be done only after the model has been trained on all the labeled data, that includes your training(X_train, y_train) and validation data(X_test, y_test).
Hence you should submit the prediction after seeing whole labeled data :- Hence clf.fit(X, Y)
I know this long explanation was not necessary, but one should know why you do what you do.
Hope it helps, thanks!!