How to interpret my logistic regression result?

Question

I'm having a hard time to interpret my result of the logistic regression.

I have a few question. Firstly, how can I check if a feature is more important to the others, like that there is a real significane by it.

I have an accuray of 0.58 which is pretty bad. Anyway, why does the RFECV feature selector of sklearn tell me to use feature1 to get the best result but by training my model where I use hte statsmodels library, the model will all 5 variables is slightly better?

Also, why does my two models when I train it with sklearn and statsmodels is different in result?

It really confuses me everything. It would be nice if someone could tell me an easy way to interpret my results and do it in one library all.

My codesnippets

import statsmodels.api as sm
X = df_n_4[cols]
y = df_n_4['Survival']
use train/test split with different random_state values
we can change the random_state values that changes the accuracy scores
the scores change a lot, this is why testing scores is a high-variance estimate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
logit_model = sm.Logit(y_train, X_train).fit()
y_pred = logit_model.predict(X_test)
cf_matrix = confusion_matrix(y_test, y_pred.round())
sns.heatmap(cf_matrix, annot=True)
plt.title('Accuracy:{}'.format(accuracy_score(y_test, y_pred.round())))
plt.ylabel('Actual Szenario');
plt.xlabel('Predicted Szenario');
plt.show()

Part 2

X = df_n_4[cols]
y = df_n_4['Srvival']
use train/test split with different random_state values
we can change the random_state values that changes the accuracy scores
the scores change a lot, this is why testing scores is a high-variance estimate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
print(len(y_train)," Testdata")
check classification scores of logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Train/Test split results:')
print(logreg.class.name+" accuracy is %2.3f" % accuracy_score(y_test, y_pred))
plt.title('Accuracy Score: {0}, Variablen: feature1'.format(round(accuracy_score(y_test, y_pred),2), size = 15))
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True)
plt.ylabel('Actual Szenario');
plt.xlabel('Predicted Szenario');
plt.show()

I don't have too much statistical background so please take it easy on me! I will provide any further information you need. Thanks so much I'm stuck since a week, as all I read and further do, confuses me even more.

The reason the output from scikit-learn and statsmodel differ is that scikit-learn uses an L2 penalty by default, see also the documentation. — Oxbowerce, Jan 16 '21 at 20:08
Thanks you for that! Already one hint! Can you also answer the other questions? @Oxbowerce And which library would you use in my case? — grumpyp, Jan 16 '21 at 20:14
With regards to what library you use it depends on whether or not you want to use an L2 penalty, since this seems to only be implemented within scikit-learn. If you do not need any regularization at all you can use both libraries, however statsmodels will be more focused on statistics (see also the output you receive). — Oxbowerce, Jan 16 '21 at 20:54
@Oxbowerce thanks. I basically used statsmodels because of its better output. I don't really know what z and P and the other values mean tho as my statistical background is not big. Sources really confuse me even more because all say something else and sometimes too mathematical. Whats the easiest way to interpret my results in terms of feature importance? So I could say the one feature is definitly significant for the result? Can you answer my other questions please as well? — grumpyp, Jan 16 '21 at 21:03
If with feature importance you mean the odds ratio displayed by scikit-learn, these can simply be gotten by taking e to the power of the coefficient (i.e. a coefficient of 0.000167 lead to an odds ratio of e^0.000167 = 1.000167. Since the statsmodels library also includes the coefficients in its output you can use numpy.exp to convert those to an odds ratio. I'm not sure however if this is a good way to measure feature importance as its based on the regression coefficients which are impacted by the scale of your features. — Oxbowerce, Jan 16 '21 at 21:59
In order to interpret significant features using stats models , you need to look at the p-value. For features where the p-value is less than your chosen level of significance (0.05 or 0.01, etc), generally 0.05, are the features that are significant in the model you fit. In your example, as we see none of the variables have p value less than 0.05 , therefore none of the features are significant. — Ankita Talwar, Jan 16 '21 at 22:58
It does not mean that none of the variables can be used as good predictors for the model, it only means that when you chose all features to fit the logistic model none of them significantly affect the outcome variable. This situation might change when you experiment with different combination of features. You can select features using forward selection, backward elimination and such techniques. There are various feature selection methods. This is when you are trying to fit the model statistically. — Ankita Talwar, Jan 16 '21 at 23:01
When you use sklearn to fit the model and use a regularization term along with it, the features and selected automatically in the process. This article might help you get more clarity on statistical part https://towardsdatascience.com/binary-logistic-regression-using-python-research-oriented-modelling-and-interpretation-49b025f1b510 — Ankita Talwar, Jan 16 '21 at 23:02
@AnkitaTalwar So what is still confusing is, that the RFECV told me to use only feature1 but the accuracy score then is lower as if I will use all variables. But in statsmodels I actually get a P value lower 0,05 which is what I wanted. But still, why is the accuracy score lower now? — grumpyp, Jan 17 '21 at 07:53

How to interpret my logistic regression result?

use train/test split with different random_state values

we can change the random_state values that changes the accuracy scores

the scores change a lot, this is why testing scores is a high-variance estimate

use train/test split with different random_state values

we can change the random_state values that changes the accuracy scores

the scores change a lot, this is why testing scores is a high-variance estimate

check classification scores of logistic regression

0 Answers0

Linked