Handling huge dataset imbalance (2 class values) and appropriate ML algorithm

Question

I have train and test sets of chronological data consisting of 305000 instances and 70000,appropriately. There are 15 features in each instance and only 2 possible class values ( NEW,OLD). The problem is that there are only 725 OLD instances in the train set and 95 in the test.

The only algorithm which succeeds for me to handle imbalance is NaiveBayes in Weka (0.02 precision for OLD class), others (trees) classify each instance as NEW. What is the best approach to handle the imbalance and the appropriate algorithm in such a case?

Thank you in advance.

I asked a somewhat related question : http://datascience.stackexchange.com/q/810/2661 — pnp, Aug 07 '14 at 12:24
Also, did you try the BayesNet (Bayesian Networks) algorithm in Weka and tried tuning the MaxNrOfParents argument in K2 search algorithm? I found it of good help in class imbalance problems. — pnp, Aug 07 '14 at 12:28
http://cs229.stanford.edu/proj2005/AltendorfBrendeDanielLessard-FraudDetectionForOnlineRetailUsingRandomForests.pdf A good read that involves a similar 'rare-event' problem. The authors use a random forest and optimize based on the ratio of class occurrence in the training set. (I'm not affiliated, but was just reading this a few days ago for a problem I'm working on). — nfmcclure, Aug 08 '14 at 16:12

score 5 · Answer 1 · answered Aug 07 '14 at 12:46

5

I'm not allowed to comment, but I have more a suggestion: you could try to implement some "Over-sampling Techniques" like SMOTE: http://scholar.google.com/scholar?q=oversampling+minority+classes

answered Aug 07 '14 at 12:46

theafh

151
1

score 3 · Answer 2 · answered Aug 07 '14 at 17:49

3

You can apply a clustering algorithm to the instances in the majority class and train a classifier with the centroids/medoids offered by the cluster algorithm. This is subsampling the majority class, the converse of oversampling the minority class.

answered Aug 07 '14 at 17:49

damienfrancois

1,486
14
8

score 2 · Answer 3 · edited Apr 13 '17 at 12:44

2

In addition to undersampling the majority class (i.e. taking only a few NEW), you may consider oversampling the minority class (in essence, duplicating your OLDs, but there are other smarter ways to do that)

Note that oversampling may lead to overfitting, so pay special attention to testing your classifiers

Check also this answer on CV:

https://stats.stackexchange.com/a/108325/49130

edited Apr 13 '17 at 12:44

Community

1

answered Aug 08 '14 at 07:12

Alexey Grigorev

2,880
1
13
19

score 0 · Answer 4 · answered Dec 20 '15 at 06:02

0

In weka, you could assign the weights to your learning instances. Assign the weights in inverse proportion to the class frequency weight and you should be good to go. Another way would be to play around with sampling.

answered Dec 20 '15 at 06:02

sudeepgupta90

133
4

Handling huge dataset imbalance (2 class values) and appropriate ML algorithm

4 Answers4