Deal with overlapping classes in classification modeling

Question

I am currently working with a dataset comprising information about crop insurance for soybeans. My ultimate goal with this dataset is to create a classification model capable of predicting whether insurance for soybeans will be activated based on bioclimatic variables. The dataset consists of 267 observations, with 171 crops that did not activate the insurance (Target = 0), and 96 observations where the insurance was activated (Target = 1; also the class I am interested in predicting). As illustrated in Figure 1, there is overlap between the classes. In Figure 1, "BG" stands for Background. For simplicity, we are not currently addressing it. It appears that class 1 is almost a subset of class 0.

Despite attempting to select the best hyperparameters for RF, XGB, and SVM_RBF models, my models performed poorly, even in training with 4 CV folds. I also tried adasyn synthetic sampling to equalize the classes, but my results did not improve. My Youden's J statistic was around 0.1, and the AUC did not exceed 0.6.

Figure 2 illustrates the same problem of overlap, but with only one of the variables. Black bars represent BG, red represents Target = 0, and blue represents Target = 1.

The variables used for this exercise were chosen based on correlation, VIF, and importance to crop suitability. After reading some papers, I found two potential approaches to address this issue:

Either delete the overlapping data points from the training set and train the model with the remaining data. Create a third class for the overlapping points (e.g., Target=3) and proceed with the classification. What do you think I should do? I understand that there is no single solution to this problem, and I plan to test different approaches. However, I would like to hear your opinions on this matter.

Note: The statistics (AUC and Youden's J statistic) were calculated based on the standard 50% probability cutoff.

For visualization purposes, I’d ditch the BG category. That might make it easier to see the differences between your categories of interest. — Dave, Mar 22 '24 at 12:07
As @Dave mentions, your features do not seem to contain information sufficient to classify. I notice that you 1) never mention what features you have, and 2) spent a lot of (wasted) effort playing around model tuning etc. It is a common pitfall for beginners, DON'T do this. Instead, focus on the data collection and features. We simply cannot predict anything without relevant and sufficient information, just like we cannot predict our body height given only the weather histories of Ohio. Garbage in, garbage out. — lpounng, Mar 26 '24 at 04:03

Dave · Answer 1 · 2024-03-22T12:06:14.703

You don’t have the information to reliably classify. When you have more features, this is hard to assess, but with just the two, you can visualize in a scatter plot like you have to see that the categories are basically the same on your features.

If you want to reliably classify or predict class membership probability, you need to get a feature that allow for the classes to be distinguished. You simply do not have that right now.

Deal with overlapping classes in classification modeling

1 Answers1