60

I am working on a problem with too many features and training my models takes way too long. I implemented a forward selection algorithm to choose features.

However, I was wondering does scikit-learn have a forward selection/stepwise regression algorithm?

Ethan
  • 1,633
  • 9
  • 24
  • 39
Maksud
  • 725
  • 1
  • 7
  • 6
  • I created my own class for that but very surprised that sklearn doesn't have that. – Maksud Aug 09 '14 at 14:36
  • 1
    Using hypothesis tests is a terrible method of feature selection. You'll have to do a lot of them and, of course, you'll get a lot of false positives and negatives. – Ricardo Cruz Jun 03 '16 at 08:51

8 Answers8

34

No, scikit-learn does not seem to have a forward selection algorithm. However, it does provide recursive feature elimination, which is a greedy feature elimination algorithm similar to sequential backward selection. See the documentation here

Ethan
  • 1,633
  • 9
  • 24
  • 39
brentlance
  • 641
  • 5
  • 5
  • 4
    Good suggestion, but the problem w/ the sci-kit implementation is that the feature importance is quantified by the model coefficients, i.e. if the model has coef_ interface. This would rule out tree based method etc. However, I think what @Maksud asked for is what is described in "An Introduction to statistical learning" by James in which features are recursively added/removed by their importance which his quantified by validation set accuracy. This allows feature selection across all types of models, not the just linear parametric ones. – eggie5 Apr 18 '17 at 21:38
  • sci-kit actually also supports (some?) tree based method for feature elimination via feature importance (see e.g., the RandomForestClassifier). But more general methods like SVMs are indeed not supported. – Martin Becker Apr 01 '20 at 17:11
  • This answer is outdated. There is indeed a forward selection implementation in sklearn. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html – anonuser01 Apr 26 '21 at 00:24
  • 1
    @anonuser01 What you suggest is not correct. The forward stepwise selection does not require n_features_to_select to be set beforehand, but the sklearn's sequentialfeatureselector (the thing that you linked) does. So these two thing are different – Eiffelbear Oct 09 '21 at 08:45
19

Scikit-learn indeed does not support stepwise regression. That's because what is commonly known as 'stepwise regression' is an algorithm based on p-values of coefficients of linear regression, and scikit-learn deliberately avoids inferential approach to model learning (significance testing etc). Moreover, pure OLS is only one of numerous regression algorithms, and from the scikit-learn point of view it is neither very important, nor one of the best.

There are, however, some pieces of advice for those who still need a good way for feature selection with linear models:

  1. Use inherently sparse models like ElasticNet or Lasso.
  2. Normalize your features with StandardScaler, and then order your features just by model.coef_. For perfectly independent covariates it is equivalent to sorting by p-values. The class sklearn.feature_selection.RFE will do it for you, and RFECV will even evaluate the optimal number of features.
  3. Use [an implementation][1] of forward selection by adjusted $R^2$ that works with statsmodels.
  4. Do brute-force forward or backward selection to maximize your favorite metric on cross-validation (it could take approximately quadratic time in number of covariates). A scikit-learn compatible mlxtend package [supports][2] this approach for any estimator and any metric.
  5. If you still want vanilla stepwise regression, it is easier to base it on statsmodels, since this package calculates p-values for you. A basic forward-backward selection could look like this:

    from sklearn.datasets import load_boston
    import pandas as pd
    import numpy as np
    import statsmodels.api as sm
data = load_boston()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target


def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out = 0.05, 
                       verbose=True):
    """ Perform a forward-backward feature selection 
    based on p-value from statsmodels.api.OLS
    Arguments:
        X - pandas.DataFrame with candidate features
        y - list-like with the target
        initial_list - list of features to start with (column names of X)
        threshold_in - include a feature if its p-value < threshold_in
        threshold_out - exclude a feature if its p-value > threshold_out
        verbose - whether to print the sequence of inclusions and exclusions
    Returns: list of selected features 
    Always set threshold_in < threshold_out to avoid infinite looping.
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

        # backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

result = stepwise_selection(X, y)

print('resulting features:')
print(result)


This example would print the following output:

Add  LSTAT                          with p-value 5.0811e-88
Add  RM                             with p-value 3.47226e-27
Add  PTRATIO                        with p-value 1.64466e-14
Add  DIS                            with p-value 1.66847e-05
Add  NOX                            with p-value 5.48815e-08
Add  CHAS                           with p-value 0.000265473
Add  B                              with p-value 0.000771946
Add  ZN                             with p-value 0.00465162
resulting features:
['LSTAT', 'RM', 'PTRATIO', 'DIS', 'NOX', 'CHAS', 'B', 'ZN']


David Dale
  • 1,551
  • 13
  • 20
  • 2
    The posted forward stepwise regression code does not function correctly. It should give identical results to backwards stepwise regression, but it does not. It is returning factors with p-values that are higher than the threshold when you rerun the regression. I also ran the same dataset with STATA and the same thresholds using backwards stepwise and obtain materially different results. Basically, don't use it. I'm going to write my own backwards stepwise regression code using his template. – Michael Corley MBA LSSBB Apr 27 '19 at 12:38
  • 4
    Forward and backward stepwise regressions are by no means guaranteed to converge to the same solution. And if you noticed a bug in my solution, please attach the code to reproduce it. – David Dale Apr 27 '19 at 14:18
  • Is it possible to change it into stepwise selection for Logistic Regresion? I have tried changing sm.OLS into sm.Logit but it produced an error – AAAA Sep 02 '20 at 08:01
  • This example is no longer working with python 3.7, the Series need dtype arg (e.g. pd.Series(index=excluded, dtype='float64') , and OLS function need to be updated. – Ibrahim.H Jan 26 '21 at 12:02
  • 1
    @Ibrahim.H When I replace argmax with idxmax, the code works with the new versions of python and libraries. – David Dale Jan 26 '21 at 14:00
  • how come you are implementing both backward and forward stepwise regression doesn't it just have to be one not both? – Maths12 Feb 10 '21 at 10:00
  • My implementation is a general stepwise selection: you start with some list of variables, and you try to include new significant features and exclude insignificant features, until the algorithm converges. If you start with an empty list, this will be forward selection. If you start with the list of all existing variables, this is a backward selection. But you can start with something in between as well. – David Dale Feb 10 '21 at 10:05
11

Sklearn DOES have a forward selection algorithm, although it isn't called that in scikit-learn. The feature selection method called F_regression in scikit-learn will sequentially include features that improve the model the most, until there are K features in the model (K is an input).

It starts by regression the labels on each feature individually, and then observing which feature improved the model the most using the F-statistic. Then it incorporates the winning feature into the model. Then it iterates through the remaining features to find the next feature which improves the model the most, again using the F-statistic or F test. It does this until there are K features in the model.

Notice that the remaining features that are correlated to features incorporated into the model will probably not be selected, since they do not correlate with the residuals (although they might correlate well with the labels). This helps guard against multi-collinearity.

makansij
  • 849
  • 2
  • 12
  • 17
  • 1
    Be aware, however: http://www.stata.com/support/faqs/statistics/stepwise-regression-problems/ – makansij May 30 '16 at 20:52
  • 1
  • yes. Also, you should read this: https://stats.stackexchange.com/questions/204141/difference-between-selecting-features-based-on-f-regression-and-based-on-r2. – makansij May 21 '17 at 21:50
  • 2
    That's sort of forward selection. But it's not generic - it is specific to a linear regression model, whereas typically forward selection can work with any model (model agnostic) as is the RFE and can handle classification or regression problems. But I suspect most people are looking for this use case and it's certainly good to mention it here. – Simon Sep 09 '17 at 20:38
  • 4
    This is not a STEPWISE selection, because each p-value is calculated for a univariate regression, independently of all the other covariates. – David Dale Nov 07 '17 at 12:17
11

As of version 0.24, it does!

Announcement, documentation

Ben Reiniger
  • 11,770
  • 3
  • 16
  • 56
3

In fact there is a nice algorithm called "Forward_Select" that uses Statsmodels and allows you to set your own metric (AIC, BIC, Adjusted-R-Squared, or whatever you like) to progressively add a variable to the model. The algorithm can be found in the comments section of this page - scroll down and you'll see it near the bottom of the page.

https://planspace.org/20150423-forward_selection_with_statsmodels/

I would add that the algorithm also has one nice feature: you can apply it to either classification or regression problems! You just have to tell it.

Try it and see for yourself.

2

Actually sklearn doesn't have a forward selection algorithm, thought a pull request with an implementation of forward feature selection waits in the Scikit-Learn repository since April 2017.

As an alternative, there is forward and one-step-ahead backward selection in mlxtend. You can find it's document in Sequential Feature Selector

Ynjxsjmh
  • 121
  • 3
  • I used it (from mlxtend), works very well and integrates nicely with scikit-learn, recommended! – dolphin Jun 24 '20 at 11:53
1

Yes

sklearn.feature_selection.SequentialFeatureSelector

https://scikit-learn.org/0.24/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html

also

http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

boston = load_boston() X, y = boston.data, boston.target

lr = LinearRegression()

sfs = SFS(lr, k_features=13, forward=True, floating=False, scoring='neg_mean_squared_error', cv=10)

sfs = sfs.fit(X, y) fig = plot_sfs(sfs.get_metric_dict(), kind='std_err')

plt.title('Sequential Forward Selection (w. StdErr)') plt.grid() plt.show()

thistleknot
  • 121
  • 1
  • 5
0

I developed this repository link. My Stepwise Selection Classes (best subset, forward stepwise, backward stepwise) are compatible to sklearn. You can do Pipeline and GridSearchCV with my Classes.

HE Xin
  • 1