4

I have train and test sets of chronological data consisting of 305000 instances and 70000,appropriately. There are 15 features in each instance and only 2 possible class values ( NEW,OLD). The problem is that there are only 725 OLD instances in the train set and 95 in the test.

The only algorithm which succeeds for me to handle imbalance is NaiveBayes in Weka (0.02 precision for OLD class), others (trees) classify each instance as NEW. What is the best approach to handle the imbalance and the appropriate algorithm in such a case?

Thank you in advance.

rokpoto.com
  • 813
  • 1
  • 7
  • 6
  • 1
    I asked a somewhat related question : http://datascience.stackexchange.com/q/810/2661 – pnp Aug 07 '14 at 12:24
  • 1
    Also, did you try the BayesNet (Bayesian Networks) algorithm in Weka and tried tuning the MaxNrOfParents argument in K2 search algorithm? I found it of good help in class imbalance problems. – pnp Aug 07 '14 at 12:28
  • http://cs229.stanford.edu/proj2005/AltendorfBrendeDanielLessard-FraudDetectionForOnlineRetailUsingRandomForests.pdf A good read that involves a similar 'rare-event' problem. The authors use a random forest and optimize based on the ratio of class occurrence in the training set. (I'm not affiliated, but was just reading this a few days ago for a problem I'm working on). – nfmcclure Aug 08 '14 at 16:12

4 Answers4

5

I'm not allowed to comment, but I have more a suggestion: you could try to implement some "Over-sampling Techniques" like SMOTE: http://scholar.google.com/scholar?q=oversampling+minority+classes

theafh
  • 151
  • 1
3

You can apply a clustering algorithm to the instances in the majority class and train a classifier with the centroids/medoids offered by the cluster algorithm. This is subsampling the majority class, the converse of oversampling the minority class.

damienfrancois
  • 1,486
  • 14
  • 8
2

In addition to undersampling the majority class (i.e. taking only a few NEW), you may consider oversampling the minority class (in essence, duplicating your OLDs, but there are other smarter ways to do that)

Note that oversampling may lead to overfitting, so pay special attention to testing your classifiers

Check also this answer on CV:

Alexey Grigorev
  • 2,880
  • 1
  • 13
  • 19
0

In weka, you could assign the weights to your learning instances. Assign the weights in inverse proportion to the class frequency weight and you should be good to go. Another way would be to play around with sampling.