Handling unbalanced datasets with XG boosting

Question

Suppose you want to model (predict) a rare disease, and you use the parameter "pos scale weight" as a hyperparameter in XG boost . For example I have 20 times more positive cases, can I then use pos scale weight = 0.05, even though for example the ratio in het the real world would be not 1/20 but 1/2000?

score 4 · Answer 1 · answered Sep 26 '18 at 15:53

This is a tricky question because it depends on your objective.

If your objective is to have comparable performance on the two classes (i.e. comparable sensitivity and specificity), and the imbalance in your training data is $1:20$, then yes, it makes sense to give the minority class examples a weight 20 times greater than for examples of the majority class.

However, in most applications, this is not exactly the objective. You should use the real costs of misclassification, or an estimation thereof. Ideally, you would quantify the costs of false negatives and false positives (although this is easier said than done, especially when comparing the cost of illness to, for example, the cost of a medical professional's wasted time).

Then, if your training set had the same class priors (how often a class occurs) as the real data, you would use these costs directly as weights for, respectively, the positive and negative classes.

In your case, it's more complicated, because the priors in your training data are different from those in the real data. You cannot use the misclassification costs directly as weights, you have to adjust for the real distribution.

Using a dataset with a $1:20$ imbalance instead of $1:2000$, you've effectively already given the negative class about 100 times more weight than it naturally has. It's like there's already a weight hidden in there. For the misclassification costs to be used as weights, you would have to cancel out that factor.

Suppose $c_{fp}$ is the cost of a false positive (misclassifying a negative example) and $c_{fn}$ is the cost of a false negative (misclassifying a positive example). If rate of positive data is $1/20 = 0.05$ in the training data and $1/2000 = 0.0005$ in the real distribution of your domain, then you would use:

$$ \frac{\frac{1}{2000}}{\frac{1}{20}}\cdot c_{fp} = \frac{c_{fp}}{100}$$ as the weight for errors on the negative class and $$\frac{\frac{1999}{2000}}{\frac{19}{20}}\cdot c_{fn} \approx 1.052 \cdot c_{fn}$$

as the weight for errors on the positive class.

Hi Vincent, thank you for your comprehensive answer. So xg-boost parameter: scale_pos_weightsum = (negative instances) / sum (positive instances). Suppose Cfp = 2 and Cfn = 4. Is then scale_pos_weight than 0.02? — Aniel Kali, Sep 27 '18 at 19:51
Not quite:
$$ w_{pos} = \frac{\frac{1999}{2000}}{\frac{19}{20}} \cdot c_{fn} = \frac{\frac{1999}{2000}}{\frac{19}{20}} \cdot 4 \approx 4.21 $$ $$ w_{neg} = \frac{\frac{1}{2000}}{\frac{1}{20}} \cdot c_{fp} = \frac{\frac{1}{2000}}{\frac{1}{20}} \cdot 2 \approx 0.02 $$

Now since xgboost does not seem to allow you to set a weight for the negative class, you'll have to bring it to 1 by dividing both weights by $w_{neg}$

$$ \text{scale_pos_weight} = \frac{w_{pos}}{w_{neg}} \approx 210 $$ — Vincent B. Lortie, Sep 27 '18 at 21:48

Handling unbalanced datasets with XG boosting

1 Answers1