Why does my logistic regression predict all 0's?

Question

I built a logistic regression in scikit-learn and all of my predicted values are 0, it can't be so. It must have at least some predictability power.

I am trying to predict which flights are likely to be delayed. In my exploratory data analysis, I found that some origins cause more problems than others and some airlines tend to be later than others as well. I, therefore, used the origin and reporting airline as my independent variables and lateness (Yes or No) as my dependent variable. I tried both Python and R and both yield the same results.

The data:

Since I can't include a data file I will try my best to explain the data. The data has simply three columns: Reporting_Airline (which has entries HJ, AK, CF, etc.) Origin (which also has data HYS, JSI, SHS, etc. they are airport names. The Destination is also like that having three letters YSU, HSU, JSA, etc. There is also the fourth column, which is 'late' and that is either 0 or 1. There are 159k entries. I have converted the dependent variables which are of Char type to dummy variables in Python and factors in R.

The code:

Here is my code in R:

Rdata2$Reporting_Airline <- as.factor(Rdata2$Reporting_Airline)
Rdata2$Dest <- as.factor(Rdata2$Dest)
Rdata2$late <- as.factor(Rdata2$late)
Rdata2$Origin <- as.factor(Rdata2$Origin
logistic <- glm(late ~ Origin+ Reporting_Airline, data=Rdata2, 
                family="binomial")
res2 = predict(logistic,Rdata2, type = "response")
res2

And here is the confusion matrix showing how well the model performed:

Actual Values	Predicted Values
	False	True
0	139191	8
1	30576	11

The Python Code:

log_reg = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=4000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=10, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
import numpy as np
log_reg
data4 = data3[['Origin', 'Dest', 'Reporting_Airline','late']]
dummies = pd.get_dummies(data4)

dummies.head
X_train, X_test, y_train, y_test = train_test_split(dummies.drop('late', axis=1),dummies.late,test_size=0.1)
log_reg.fit(X_train, y_train)
log_reg.predict(y_train)

y_pred = log_reg.predict(y_train)

And it returns everything to be zero...

What kind of ROCAUC do you get? You have a considerable class imbalance, it seems. While probabilistic predictions are the right way to assess models of balanced data, too, your prior distribution shifts the posterior probability down towards $0$…as it should. Software usually uses a classification threshold of $0.5$. — Dave, Jun 18 '21 at 03:10
I would add, that you have too few samples for the "True" class, and your model prediction for this class is bad 58% correct predictions only for True Class in the confusion matrix. You should get more True samples. Moreover predicting lateness on the selected features may be very difficult, could you have some extra meaningful features ? maybee adding features may help the model: weather, crowded plane or airport, hollydays, etc ... — Malo, Jun 22 '21 at 06:33

score 1 · Answer 1 · edited Jun 21 '21 at 08:58

The predictions are always 0 due to the massive imbalance in the data.

The positive class represents only 0.01% of the data. In this context, for the model to "take the risk" of predicting some instances as positive, it would need some very strong indicators. Essentially the model needs to be 99.99% sure when it predicts an instance as positive, because if the confidence is any lower then the model would make more False Positive errors than it finds True Positive instances. Since it never reaches this level of confidence for a positive instance, it predicts every instance as negative. You could look at the probability distribution in your current experiment: chances are that the probabilities are all ridiculously low.

If you really want to see some instances predicted as positive, you can:

undersample the negative class and/or oversample the positive class. This will force the model to predict more positive instances since the risk of error is relatively lower.
Change the threshold for predicting the class: the default is to pick the class which has the highest probability, in other words, to predict a positive only if the probability of positive is higher than 0.5. If you set the threshold to a very small value $p$ instead, the instances which obtain a higher probability than $p$ are going to be predicted as positive.

Naturally, both options will cause a lot more False Positive errors on the test set, which must follow the true distribution of course.

Why does my logistic regression predict all 0's?

1 Answers1