How to use the $\chi^2$ test to select the features, that can be String or categorical?

Question

I want statistics to select the characteristics that have the greatest relationship to the output variable.

Thanks to this article, I learned that the scikit-learn library proposes the SelectKBest class that can be used with a set of different statistical tests to select a specific number of characteristics.

Here is my dataframe:

    Do you agree    Gender  Age     City     Urban/Rural  Output
0   Yes             Female  25-34   Madrid   Urban        Will buy
1   No              Male    18-25   Valencia Rural        Won't
2   ...             ...     ...     ...      ...          Undecided
....

The output is 'Will buy', 'won't' and 'undecided'.

I then tried the chi-square statistical test for non-negative characteristics to select 10 of the best characteristics:

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  #print 10 best features

But certain columns are 'String'. So, I get the terminal back:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-59-e64d61febefd> in <module>
      1 bestfeatures = SelectKBest(score_func=chi2,k=10)
----> 2 fit = bestfeatures.fit(X,y)
      3 dfscores = pd.Dataframes(X.columns)
      4 #concat two dataframes for better visualization
      5 featuresScores = pd.concat([dfcolumns,dfscores], axis = 1)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_selection\univariate_selection.py in fit(self, X, y)
    339         self : object
    340         """
--> 341         X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
    342 
    343         if not callable(self.score_func):

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    754                     ensure_min_features=ensure_min_features,
    755                     warn_on_dtype=warn_on_dtype,
--> 756                     estimator=estimator)
    757     if multi_output:
    758         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    565         # make sure we actually converted to numeric:
    566         if dtype_numeric and array.dtype.kind == "O":
--> 567             array = array.astype(np.float64)
    568         if not allow_nd and array.ndim >= 3:
    569             raise ValueError("Found array with dim %d. %s expected <= 2."

ValueError: could not convert string to float: 'Yes'

score 0 · Accepted Answer · answered Feb 05 '20 at 14:15

0

You can only compute chi2 between two numerical arrays. You are getting that error because you are comparing a string. Also I am not sure if it works for multiclassification also.

df = df.apply(LabelEncoder().fit_transform)

This will solve the problem for you. But there are a thousand ways to encode features and for sure other will work better for you.

answered Feb 05 '20 at 14:15

Carlos Mougan

6,252
2
18
48

Thanks for your answer. Do you know how I can handle aTypeError when doing fit_transform? Indeed, while applying it to a larger dataset I got: TypeError: ("'<' not supported between instances of 'NoneType' and 'str'", 'occurred at index Segment Cluster') – Revolucion for Monica Feb 05 '20 at 15:28

How to use the $\chi^2$ test to select the features, that can be String or categorical?

1 Answers1