LabelEncoding your features is a bad practice
You should avoid using LabelEncoder
to encode your input features! Don't believe me? Here's what scikit-learn's official documentation for LabelEncoder
says:
This transformer should be used to encode target values, i.e. y
, and not the input X
.
That's why it's called LabelEncoding.
Why you shouldn't use LabelEncoder
to encode features.
This encoder simply makes a mapping of a feature's unique values to integers. For example, let's say we want to encode a feature called shirt color, which represents the color of the shirt someone's wearing. This feature has values ['red', 'green', 'blue', ...]
. If you encode these into integers, i.e. [1, 2, 3, ...]
, you might confuse your model by because you have now given relationships to these values that don't exist in the real world, e.g. red < greed < blue
or red + green = blue
. This type of feature is called nominal and preferably should be one-hot encoded.
There are features however, where you might want to map their values to integers. These are called ordinal. For example, the feature rating, which has values ['bad', 'good', 'excellent', ...]
. By mapping these to integers you actually preserve the relationsips these values hold in the real world, e.g. bad < good < excellent
. There is a catch to this however, in order to do the above, you need to map each value with a specific integer (e.g. we can't map 'good' -> 1, 'bad' -> 2, 'excellent' -> 3
, because that doesn't preserve the real-world relationship of these values). The computer doesn't know which number to map to each value, though, so if you use LabelEncoder
even on ordinal variables, it most likely won't generate the correct encoding.
How to properly encode ordinal features
A more proper way of encoding ordinal variables is manually choosing the mapping. This requires more work and isn't as elegant as a one-liner that encodes all values, but is the only correct way. Let's see how we can do this in pandas.
custom_mapping = {'bad': 1, 'good': 2, 'excellent': 3}
df['rating'] = df['rating'].map(custom_mapping)
Obviously this needs to be done for each ordinal feature.
At this point I think it's clear that I strongly recommend against using LabelEncoder
, but if you still want to do it at least do it correctly.
If you still want to use LabelEncoding
While both answers by @ggordon and @Anan Srivastava will do what you want, they don't have much value in practice. The problem isthat by not bounding the fitted LabelEncoder
to a variable, you are loosing the mapping from categories to numbers. If you want to predict on future data, you won't know which number to encode each category with.
Expanding upon @ggordon's answer
columns_to_be_encoded = [...] # list of column names you want encoded
# Instantiate the encoders
encoders = {column: LabelEncoder() for column in columns_to_be_encoded}
for column in columns_to_be_encoded:
df[column] = encoders[column].fit_transform(df[column])
This way you have a dictionary of fitted encoders so that you can reuse the same encoding if you wish.