Changing reference class in imbalanced data drastically affects the error rate

Question

Working on a binary classification problem that tries to predict customer churn, the data set is imbalanced with 2000 observations of non-churn cases vs 600 observations of churn cases.

On using GLM I see that when the majority class[Non-churn] is the reference level I get ~40 % error rate[confusion matrix] on both the levels [churn non-churn] but when the minority class is set as the reference level I get 100% error rate in predicting the minority class or in a way everything gets predicted as non-churn case.

After balancing the data using SMOTE the same trend continues, how should I interpret this behaviour. ?

Is it in a way saying that the non-churn population has users who have similar behaviour as the churners and hence the high error rate, but at the same time the non-churn users have a subset which are quite different than the churners in their behaviour and hence lower error rate when the reference class is the majority or the non-churn class.

Outcome on test data when majority class is set as the reference class:
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
             0   1    Error      Rate
    0      268 419 0.609898  =419/687
    1       46 168 0.214953   =46/214
    Totals 314 587 0.516093  =465/901

Outcome on test data minority class is set as the reference class:
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
                   1   0    Error      Rate
            1      3 211 0.985981  =211/214
            0      1 686 0.001456    =1/687
            Totals 4 897 0.235294  =212/901

I'm a little confused by your second paragraph -- would you be able to edit the question to include the exact numbers in your confusion matrix? — timleathart, Nov 06 '17 at 10:17
In what part of the algorithm do you use 1 or 0 as "reference"? For all I know about GLM, swapping zeros and ones should not affect error rate at all. — David Dale, Nov 07 '17 at 16:00

score 1 · Answer 1 · answered Nov 07 '17 at 16:12

Decision function of GLM itself does not depend on the choice of "reference level". What can probably depend is your threshold, and I guess it is chosen poorly.

For the problem of churn prediction, you probably shouldn't use error rate or confusion matrix at all. You can read here why ROC AUC or other metrics can be preferred to accuracy. Or if you still use error rate, choose your threshold in a wiser way than letting an algorithm maximize F1 metric for a single class.

What you need in the end is to decide for each client whether to treat her as ready-to-churn (it would certainly cost you $c_1$) or leave her alone (but if she churns, you lose $c_2$). If it is the case, you will profit from your anti-churn measures iff the probability of churn is higher than $\frac{c_1}{c_2}$. This is the natural threshold for your classification problem, and you can use the corresponding cost function to measure your success. Or if you don't know exact losses $c_1$ and $c_2$ in advance, use ROC AUC, which averages all the possible thresholds. And yes, ROC AUC is not affected by class balance/imbalance.

Changing reference class in imbalanced data drastically affects the error rate

1 Answers1