13

I have data with 5 output classes. The training data has the following no of samples for these 5 classes: [706326, 32211, 2856, 3050, 901]

I am using the following keras (tf.keras) code:

class_weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(y_train),
                                                 y_train)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(50, input_shape=(dataX.shape[1],)),
    tf.keras.layers.Dropout(rate = 0.5),
    tf.keras.layers.Dense(50, activation=tf.nn.relu),
    tf.keras.layers.Dropout(rate = 0.5),
    tf.keras.layers.Dense(50, activation=tf.nn.relu),
    tf.keras.layers.Dropout(rate = 0.5),
    tf.keras.layers.Dense(50, activation=tf.nn.relu),
    tf.keras.layers.Dropout(rate = 0.5),
    tf.keras.layers.Dense(5, activation=tf.nn.softmax) ])
     adam = tf.keras.optimizers.Adam(lr=0.5)

model.compile(optimizer=adam, 
              loss='sparse_categorical_crossentropy',
              metrics=[metrics.sparse_categorical_accuracy])    
     model.fit(X_train,y_train, epochs=5, batch_size=32, class_weight=class_weights)

y_pred = np.argmax(model.predict(X_test), axis=1)

The first line on class_weight is taken from one of the answers in to this question: How to set class weights for imbalanced classes in Keras?

I know about this answer: Multi-class neural net always predicting 1 class after optimization . The difference is that in that case, the class weights wasn't used whereas I am using it.

I am using sparse_categorical_crossentropy which accepts categories as integers (don't need to convert them to one-hot encoding), but I also tried categorical_crossentropy and still the same problem.

I have of course tried different learning rate, batch_size, no of epochs, optimizer, and depth/length of the network. But it always is stuck at ~0.94 accuracy which is essentially I would get if I predict the first class all the time.

Not sure what is missing here. Any error? Or should I use some other specialized deep network?

dbm
  • 251
  • 1
  • 2
  • 7
  • 1
    Have you tried other loss functions that are tailored towards imbalances classes (such as F1 loss or Focal loss) – Shamit Verma Feb 01 '19 at 08:15
  • The problem could be your model architecture: fully connected layers. Try using other layers other than that. Maybe don't even use fully connected layers at all ... – Antonio Jurić Feb 01 '19 at 12:10
  • @ShamitVerma, are these loss functions available ready to use in Keras? – dbm Feb 01 '19 at 14:41
  • @AntoniiJuric, which other layers did you mean? – dbm Feb 01 '19 at 14:42
  • @dbm no, you will have to write those as per your use-case and backend. – Shamit Verma Feb 01 '19 at 15:01
  • @AntonioJurić, The problem is not about worse performance than expected etc. here. The problem is that class_weight is not doing anything at all in the above code. No matter what class_weight I choose, there is no difference! In the end, the model only learns the dominant class. There is something wrong either in my code or with the class_weight implementation in keras. – dbm Feb 02 '19 at 04:12
  • @ShamitVerma, Also, I think weight_class should capture at least a part of the imbalance in the data in the loss function. Apparently it is not capturing with the above code. – dbm Feb 02 '19 at 04:41
  • Your learning rate is pretty high. Have you tried numbers closer to 0.001? Also try turning down your dropout. – kbrose Feb 04 '19 at 14:15
  • @kbrose, tried them. No luck. – dbm Feb 04 '19 at 23:53
  • Cant comment. Hence posting as answer Can you please paste the output of class_weights? – solver149 Feb 05 '19 at 19:20
  • 2
    I usually see NNs converge to the mean class label quickly, but they improve if I let them continue to train even if it seems to not make progress for several epochs – kbrose Feb 06 '19 at 14:03
  • @dbm While your accuracy stuck at 0.94, have you checked the prediction of your trained model only predicts the dominant class? This will be the first step to check against your hypothesis. Maybe 0.94 is actually the limit of your model given the dataset. – Louis T Feb 09 '19 at 04:56
  • @Lous T, yes I checked. – dbm Feb 10 '19 at 22:23
  • Tell me more about the data What kind of features you pass to network? Multi-label or multi-class classification? What metrics do you accept as good solution? – Евгений Смирнов Feb 05 '19 at 17:20

6 Answers6

9

1) A five-layer neural network is one heck of a complex model for a data set with less than 1 million points. (I’m trying to find a good link for this, but the intuition is that your choice of model should be driven by the complexity of the available data, and not by what you think the real target function is like.) If this is for a real-world project, a tool like XGBoost might work better on this data set — out of the box, you’ll spend less time dealing with problems related to imbalanced classes, poorly scaled data, or outliers. Of course if this is specifically for learning about neural networks, that advice isn’t much help!

2) For a class distribution that’s as skewed as your data, you might get more mileage from re-sampling the training data rather than re-weighting the classes during training. Down-sample the majority class first (just throw away majority samples at random); if that’s not satisfactory then try something more complicated like using SMOTE to up-sample the minority classes. Try taking this to the extreme; build a (collection of) new training sets by randomly sampling only 1,000 points from each class.

The intuition here is that, for neural networks, as far as I know, re-weighting classes basically means re-scaling the gradient steps during training based on the class weights. If the classes are skewed 10:1, that makes sense: we take a step that’s 10 times as far for a minority sample. If the classes are skewed 1000:1, as in your case, it makes less sense — we’ll take 1,000 small steps as we optimize on the majority class, and then a single gigantic step in an essentially random direction when we happen to see a minority sample, followed by 1,000 tiny steps trying to un-do this work, etc. We don’t see enough minority samples to allow information about their class to average out.

DGrady
  • 191
  • 3
4

One important thing to check is if by any chance you have NaNs in the input data.

Had the same problem, turned out NaN's had crept in the input, working great now!

2

Is your input data standardised? What happens when you run it with only one hidden layer? What happens when you set learning rate to 1e-6 or 1e-5? What's your result when you run this through a logistic regression? What does the confusion matrix look like?

When you are using class_weights= you should also use weighted_metrics= for your convenience.

grofte
  • 141
  • 4
1

Why don't you try with gradient boosting or adaBoost? They should perform well in unbalanced data as, during the training, they tend to give weights to misclassified observations, improving then the performance. Lemme know

3nomis
  • 541
  • 6
  • 17
  • They don't seem to perform too well. AUCROC~0.64, weighted F1 ~0.65. – dbm Feb 02 '19 at 04:40
  • Did you do a grid search? – 3nomis Feb 02 '19 at 16:32
  • Yes. I did but no improvement. – dbm Feb 02 '19 at 16:34
  • One suggestion might be to do oversampling methods such as SMOTE. You synthetically generate observation and allow the net to train on the least crowded groups – 3nomis Feb 02 '19 at 17:40
  • surw. I can do that. But the questuin is about why thw above code wont work - i am changing the class weight to balanced, and would expect deep network to learn at least something beyond the dominant class. – dbm Feb 02 '19 at 22:32
  • What's the proportion of the classes? If the dominant is extremely numerous and the minority classes are uncharacterised by any pattern in the features would be hard to classify them. – 3nomis Feb 05 '19 at 16:27
1

I think it is probably doing something, just not enough to change the overall classification. Have you inspected the estimated probability distributions of the minority class test examples for different class_weights? I imagine that the probabilities of the true classes are somewhat higher for these examples, even though the predicted most likely class is still the majority class.

If this is the case, then you can determine a value (through cross-validation or estimated on a held-out set) to subtract from the element of the estimated probability distribution corresponding to the top class before taking an argmax. i.e. test a range of values and pick the one that provides the best $F_1$ score or whatever else you want to optimise. Something like this:

# train neural net
# ... 

y_proba = model.predict(X_val)
best_score = 0.
best_value = 0.

# try every value from 0. to 1., in increments of 0.01
for i in np.linspace(0., 1., 101):
    alt_y_proba = y_proba - np.array([i, 0., 0., 0., 0.])
    alt_y_proba = np.clip(alt_y_proba, 0., 1.) # ensure no negative values!

    y_pred = np.argmax(alt_y_proba, axis=1)
    score = # some score function goes here

    if score > best_score:
        best_score = score
        best_value = i

print(best_value)    

Yes, it's a little hacky, but it may give you good results. You can re-normalise alt_y_proba to be a proper probability distribution again if you want, but it won't change the classification. Just make sure that the dataset you're optimising this subtraction value with is not used in the training of the neural net, as this may introduce overfitting.

timleathart
  • 3,940
  • 21
  • 35
  • is it essentially ROC curve for multiclass classification? Is there any api available from scikit learn to do this? – dbm Feb 05 '19 at 05:50
1

I just figured out the same issue. Try to pass a dictionary like:

{
   0: 1.0, 
   1: 10, 
   2: 20, 
   3: 20, 
   4: 20
}

to class_weight in model.fit() and it will solve the problem. Understood that it says a list will also work in the docs - but seems like it does not work as well for me.

Stephen Rauch
  • 1,783
  • 11
  • 22
  • 34
Yz Liu
  • 11
  • 2