8

I am trying to do ordinal encoding using:

from sklearn.preprocessing import OrdinalEncoder

I will try to explain my problem with a simple dataset.

X = pd.DataFrame({'animals':['low','med','low','high','low','high']})
enc = OrdinalEncoder()
enc.fit_transform(X.loc[:,['animals']])

array([[1.],
       [2.],
       [1.],
       [0.],
       [1.],
       [0.]])

It is labelling alphabetically, but if I try:

enc = OrdinalEncoder(categories=['low','med','high'])
enc.fit_transform(X.loc[:,['animals']])

Shape mismatch: if n_values is an array, it has to be of shape (n_features,).

Which I do not understand. I would like to be able to decide how the labelling is done.

I considered doing this:

level_mapping={'low':0,'med':1,'high':2}
X['animals']=data['animals'].replace(level_mapping)

However, I have large number of features in my dataset which have similar categories.

Thanks.

Ethan
  • 1,633
  • 9
  • 24
  • 39
Ayush Ranjan
  • 401
  • 1
  • 4
  • 14

1 Answers1

16

I'm not sure if you ever figured this out but I was trying to find answers on this exact same question and there aren't really any good answers in my opinion. I finally figured it out though. OrdinalEncoder is capable of encoding multiple columns in a dataframe. So, when you instantiate OrdinalEncoder(), you give the categories parameter a list of lists:

enc = OrdinalEncoder(categories=[list_of_values_cat1, list_of_values_cat2, etc])

Specifically, in your example above, you would just put ['low', 'med', 'high'] inside another list:

end = OrdinalEncoder(categories=[['low', 'med', 'high']])
enc.fit_transform(X.loc[:,['animals']])
>>array([[0.],
         [1.],
         [0.],
         [2.],
         [0.],
         [2.]])
# Now 'low' is correctly mapped to 0, 'med' to 1, and 'high' to 2

To see how you can encode multiple columns with their own individual ordinal values, try this:

# Sample dataframe with 2 ordinal categorical columns: 'temp' and 'place'
categorical_df = pd.DataFrame({'my_id': ['101', '102', '103', '104'],
                               'temp': ['hot', 'warm', 'cool', 'cold'], 
                               'place': ['third', 'second', 'first', 'second']})

In the 'temp' column, I want 'cold' to be 0, 'cool' to be 1, 'warm' to be 2, and 'hot' to be 3

In the 'place' column, I want 'first' to be 0, 'second' to be 1, and 'third' to be 2

temp_categories = ['cold', 'cool', 'warm', 'hot'] place_categories = ['first', 'second', 'third']

Now, when you instantiate the encoder, both of these lists go in one big categories list:

encoder = OrdinalEncoder(categories=[temp_categories, place_categories])

encoder.fit_transform(categorical_df[['temp', 'place']]) >>array([[3., 2.], [2., 1.], [1., 0.], [0., 1.]])

fugumagu
  • 176
  • 2
  • 3
  • Thank you! list of lists data type for categories input was exactly what I was looking for. I agree: sklearn's documentation on this kinda sucks. You really helped me out! – EEE Feb 23 '21 at 18:43