I built a logistic regression in scikit-learn and all of my predicted values are 0, it can't be so. It must have at least some predictability power.
I am trying to predict which flights are likely to be delayed. In my exploratory data analysis, I found that some origins cause more problems than others and some airlines tend to be later than others as well. I, therefore, used the origin and reporting airline as my independent variables and lateness (Yes or No) as my dependent variable. I tried both Python and R and both yield the same results.
The data:
Since I can't include a data file I will try my best to explain the data. The data has simply three columns: Reporting_Airline (which has entries HJ, AK, CF, etc.) Origin (which also has data HYS, JSI, SHS, etc. they are airport names. The Destination is also like that having three letters YSU, HSU, JSA, etc. There is also the fourth column, which is 'late' and that is either 0 or 1. There are 159k entries. I have converted the dependent variables which are of Char type to dummy variables in Python and factors in R.
The code:
Here is my code in R:
Rdata2$Reporting_Airline <- as.factor(Rdata2$Reporting_Airline)
Rdata2$Dest <- as.factor(Rdata2$Dest)
Rdata2$late <- as.factor(Rdata2$late)
Rdata2$Origin <- as.factor(Rdata2$Origin
logistic <- glm(late ~ Origin+ Reporting_Airline, data=Rdata2,
family="binomial")
res2 = predict(logistic,Rdata2, type = "response")
res2
And here is the confusion matrix showing how well the model performed:
Actual Values | Predicted Values | |
---|---|---|
False | True | |
0 | 139191 | 8 |
1 | 30576 | 11 |
The Python Code:
log_reg = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=4000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=10, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
import numpy as np
log_reg
data4 = data3[['Origin', 'Dest', 'Reporting_Airline','late']]
dummies = pd.get_dummies(data4)
dummies.head
X_train, X_test, y_train, y_test = train_test_split(dummies.drop('late', axis=1),dummies.late,test_size=0.1)
log_reg.fit(X_train, y_train)
log_reg.predict(y_train)
y_pred = log_reg.predict(y_train)
And it returns everything to be zero...