What makes the validation set a good representative of the test set?

Question

I am developing a classification model using an imbalanced dataset. I am trying to use different sampling techniques to improve the model performance.

For my baseline model, I defined an AdaBoost model like so:


    from sklearn.model_selection import KFold
    kf = KFold(n_splits=5, shuffle=False)
ada = AdaBoostClassifier(n_estimators=100, random_state=42)

params = {
    'n_estimators': [50, 100, 200],
    'random_state': [42]
}

grid_ada = GridSearchCV(ada, param_grid=params, cv=kf, n_jobs=-1,
                       scoring='precision').fit(X_train, y_train)

# The best precision score is
grid_ada.best_score_

0.5294693068932419


    # look at all the validation scores
    grid_ada.cv_results_['mean_test_score']

array([0.51916435, 0.52946931, 0.48800155])


    # check the test scores are in line with what we expect from the CV scores 
    precision_score(y_test, grid_ada.predict(X_test))

0.4423076923076923

In this case, I am not able to determine if my validation result (~53%) was a good representative of my test result (~44%). And if it isn't, why is that the case

I suppose my question can be split into 3 parts:

When do we determine that a validation set is a good representative of a test set? Should the difference between the two results be between a certain range?
What are some of the reasons for large discrepancies between the validation and test set results? I know from a previous question that data leakage from training set into validation set by upsampling the data before splitting it can cause this. But are there any other obvious reasons?
Does class imbalance influence the reliability of the validation results? So should I be using StratifiedKFold as the scikit learn documentation states:

Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use stratified sampling as implemented in StratifiedKFold and StratifiedShuffleSplit to ensure that relative class frequencies is approximately preserved in each train and validation fold.

UPDATE:

I have taken two additional steps that have made my validation set more representative of my test set:

I am now using the StratifiedKFold for the cross validation, like so:


    from sklearn.model_selection import StratifiedKFold
    skf = StratifiedKFold(n_splits=5, shuffle=False)

Regarding the initial split of the data between and train and test split, I am now using the stratify option in this method because I expect that future data this model will receive to make predictions will have a similar, imbalanced distribution between classes, so stratification by class percentages make sense:


    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
                            shuffle=True, test_size=0.2, random_state=42)

score 1 · Accepted Answer · answered Sep 29 '20 at 15:10

The general idea of using a validation set is to analyse or control the training of the model. Usually, the model tends to under-fit or over-fit, and this can be observed using the validation split. In practical applications, we just have the visible training set and unknown test set. So, with some assumption on the test set, a simple assumption can be such that test set belongs to the similar distribution as of the training set, we use part of training set as validation set. Once the validation set is chosen, we train the system on the training set and at every required step we observe how the trained model performs on the validation set. (Here, we are still under an assumption that the test-set is not visible). Hence, we can use the validation to check if the model is being trained well and to control the training (early stopping) especially in the case of over-fitting. With this assumption a good validation set should be something which has similar distribution as that of the test set. However, most of the times since we split the available dataset into training, validation and test set, the validation set usually has similar distribution as that of test set.
As mentioned above, difference in the distributions of test set and validation set can cause some discrepancies in the results. As you pointed out, if we upsample the training set before obtaining the validation split, then the validation set will contain different distribution than that of the test set. The discrepancies also occur when the the test set is not provided before-hand, and if it has different distribution than that of training/validation set.
Generally, a random-uniform-split produces the similar distribution, i.e, the ratio of imbalance reflects in all the splits. However, when the number of samples are small (either overall number of samples or number of samples in minority class), the ratio may not be proportional across the splits. Hence, given an option to have stratified-split, you can prefer the stratified split. (As mentioned, it may not make much difference if random-uniform-distribution produces similar ratios across the split).

Thank you for your answer. Regarding point 1, I feel it did not fully answer my question about how I can determine that a validation result is representative of the test set? So I am still unsure if there is an acceptable range of difference between the validation and test set results? — sums22, Sep 30 '20 at 09:03
I think, I have indirectly answered it. To put it in a simpler way, there is no such quantitative measure. — Ashwin Geet D'Sa, Sep 30 '20 at 17:00

score 1 · Answer 2 · answered Sep 29 '20 at 15:24

1

If you are concerned about imbalanced data, you should be using sklearn.model_selection.StratifiedKFold which preserves the percentage of samples for each class. That way the validation set is representative of underlying class percentages.

answered Sep 29 '20 at 15:24

Brian Spiering

21,136
2
26
109

Thank you, but this only answers a 1/3 of my question. – sums22 Sep 30 '20 at 11:18

What makes the validation set a good representative of the test set?

2 Answers2