1

so I'am doing a logistic regression with statsmodels and sklearn. My result confuses me a bit. I used a feature selection algorithm in my previous step, which tells me to only use feature1 for my regression.

The results are the following:

enter image description here

So the model predicts everything with a 1 and my P-value is < 0.05 which means its a pretty good indicator to me. But the accuracy score is < 0.6 what means it doesn't say anything basically.

Can you give me a hint how to interpret this? It's my first data science project with difficult data.

My code:

X = df_n_4["feat1"]
y = df_n_4['Survival']

use train/test split with different random_state values

we can change the random_state values that changes the accuracy scores

the scores change a lot, this is why testing scores is a high-variance estimate

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2) print(len(y_train)," Testdaten")

check classification scores of logistic regression

logit_model = sm.Logit(y_train, X_train).fit() y_pred = logit_model.predict(X_test) print('Train/Test split results:') plt.title('Accuracy Score:{}, Variablen: feat1'.format(round((accuracy_score(y_test, y_pred.round())),3))) cf_matrix = confusion_matrix(y_test, y_pred.round()) sns.heatmap(cf_matrix, annot=True) plt.ylabel('Actual Szenario'); plt.xlabel('Predicted Szenario'); plt.show() print(logit_model.summary2())

grumpyp
  • 157
  • 5
  • 1
  • 1
    No unfortunatly not @Oxbowerce – grumpyp Jan 17 '21 at 10:26
  • 1
    Make sure you add an intercept to the model (Not added automatically in statsmodels). If this does not help, use „shrinkage“ (e.g. from sklearn) or switch to another method than Logit. https://stats.stackexchange.com/questions/440242/statsmodels-logistic-regression-adding-intercept – Peter Jan 17 '21 at 18:24
  • https://datascience.stackexchange.com/a/74445/71442 – Peter Jan 17 '21 at 18:25
  • Read here, why a constant in linear models is usually needed: https://datascience.stackexchange.com/questions/80812/removing-constant-from-the-regression-model/80822?noredirect=1#comment92035_80822 – Peter Jan 17 '21 at 18:27
  • @Peter I use a logistic model, is that the same than what you mean by linear? – grumpyp Jan 18 '21 at 09:53
  • Logit is not exactly the same as a pure linear (ols) model, but both would usually need an intercept term. One reason for this is that the models are linear in parameters like y=a+bx+u – Peter Jan 18 '21 at 20:21
  • @grumpyp your model is only predicting the class 1 which should tell you that something is wrong with the way you trained your model. It is not predicting class 2. What kind of feature selection have you done? If you could add that part of code as well! – spectre Oct 24 '21 at 11:26
  • 1
    Check the probability outputs of your model, not just the classes. Remember that a logistic regression does not explicitly perform classification; logistic regression gives you probability values that you can compare to a threshold (often $0.5$ is the software default) to get a category, though this may not be what you want to do (1) (2). – Dave Oct 24 '21 at 12:56

2 Answers2

0

To summarize from the comments:

  1. statsmodels doesn't automatically add an intercept.

  2. Use the predicted probabilities, not just the hard classification (that you've obtained by rounding the predictions).

It doesn't seem to me that anything is seriously wrong with the model, though perhaps it's not a particularly great model. I would try some models with more of the features; you haven't said anything about the feature selection method, and feature selection is hard.

Ben Reiniger
  • 11,770
  • 3
  • 16
  • 56
0

Something's wrong with your feature selection tool: p-value is NaN, confidence interval includes $0$. Confusion matrix shows that all observations are predicted as Class 1. How many explanatory variables do you have? Try using all of them instead of just one. Are you sure

logit_model = sm.Logit(y_train, X_train).fit()

is correct? Shouldn't it be the other way around, logit_model = sm.Logit(X_train, y_train).fit()?

Alex
  • 767
  • 6
  • 17
  • I think it's correctly like logit_model = sm.Logit(y_train, X_train).fit(). What do you mean with your confidence interval? In my model where I use all features it works better. But if I use sklearn and one feature it works as well. It's all so confusing! – grumpyp Jan 17 '21 at 13:02
  • obviously from what you wrote your model with a single feature doesn't work at all – Alex Jan 17 '21 at 13:02
  • Can you tell me why? @Alex – grumpyp Jan 18 '21 at 09:43
  • I don't know but the confusion matrix shows it – Alex Jan 18 '21 at 18:45