I have a 40 to 1 imbalance for my binary classification problem. I proceeded to solve the imbalance by oversampling and I generated synthetic samples to oversample my minority class. After having almost equal number of samples in both classes, my classification AUC ROC is 0.99 which is way better than I expected. I feel like my machine learning model is just memorizing samples that are too similar and not very generalizable. The analogy I think of is my initial majority class is like flat city and my minority class with the augmentation is a tall tower so it is very easy for my model to identify it. Is it common achieve perfect classification after data augmentation? How do I avoid the model memorizing data? Thanks
Asked
Active
Viewed 47 times
0
-
1Score on train or test? Did you create synthetic examples before or after splitting? What's the score when training on the original data? What kind of model? – Ben Reiniger Mar 22 '22 at 17:25
-
AUROC score on both train and test was 0.99 so no overfitting. I created synthetic examples before splitting. On the origina AUROC is misleading because it is very imbalanced (1:40). I did PR-AUC and it was 0.6. I do neural net and logistic regression mostly on these. – Kamyar Yazdani Mar 22 '22 at 19:22
-
1You should not create synthetic examples before splitting. https://datascience.stackexchange.com/q/15630/55122 – Ben Reiniger Mar 22 '22 at 20:14
-
How would you do cross validation with synthetic samples added? – Kamyar Yazdani Mar 24 '22 at 12:12
-
1@KamyarYazdani in cross-validation you should create the synthetic samples inside the CV loop using only the training set specific to this fold. This way the test set for every fold represents the true distribution of the data. – Erwan Mar 24 '22 at 19:59
-
Makes sense ... thanks so much. – Kamyar Yazdani Mar 25 '22 at 14:03