I am developing a classification model using an imbalanced dataset. I am trying to use different sampling techniques to improve the model performance.
For my baseline model, I defined an AdaBoost model like so:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=False)
ada = AdaBoostClassifier(n_estimators=100, random_state=42)
params = {
'n_estimators': [50, 100, 200],
'random_state': [42]
}
grid_ada = GridSearchCV(ada, param_grid=params, cv=kf, n_jobs=-1,
scoring='precision').fit(X_train, y_train)
# The best precision score is
grid_ada.best_score_
0.5294693068932419
# look at all the validation scores
grid_ada.cv_results_['mean_test_score']
array([0.51916435, 0.52946931, 0.48800155])
# check the test scores are in line with what we expect from the CV scores
precision_score(y_test, grid_ada.predict(X_test))
0.4423076923076923
In this case, I am not able to determine if my validation result (~53%) was a good representative of my test result (~44%). And if it isn't, why is that the case
I suppose my question can be split into 3 parts:
- When do we determine that a validation set is a good representative of a test set? Should the difference between the two results be between a certain range?
- What are some of the reasons for large discrepancies between the validation and test set results? I know from a previous question that data leakage from training set into validation set by upsampling the data before splitting it can cause this. But are there any other obvious reasons?
- Does class imbalance influence the reliability of the validation results? So should I be using StratifiedKFold as the scikit learn documentation states:
Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use stratified sampling as implemented in StratifiedKFold and StratifiedShuffleSplit to ensure that relative class frequencies is approximately preserved in each train and validation fold.
UPDATE:
I have taken two additional steps that have made my validation set more representative of my test set:
- I am now using the StratifiedKFold for the cross validation, like so:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=False)
- Regarding the initial split of the data between and train and test split, I am now using the stratify option in this method because I expect that future data this model will receive to make predictions will have a similar, imbalanced distribution between classes, so stratification by class percentages make sense:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
shuffle=True, test_size=0.2, random_state=42)