I'm working on a sample project and one of the features is the job description of a person (categorical, for example: blue-collar, retired, unknown, unemployed, student, etc.). Since in the future more job descriptions could be used, I don't think one hot encoding it's the best approach. How would you encode it without using one hot encoding??
-
1What's the goal of the project? If it's a sample project and you already have a dataset in hand, why worry about additional values in the future? – zachdj Mar 23 '20 at 16:49
3 Answers
Solution 1 : Target Encoding using Weight of evidence
Weight of evidence would be a good candidate for this scenario. Initially when you have train data calculate weight of evidence on train data as follows;
Calculate the number of events and non-events in each group (bin)
Calculate the % of events and % of non-events in each group.
Calculate WOE by taking natural log of division of % of non-events and % of events
This takes care of high cardinality also. ** We very cautious while using weight of evidence as they tend to lead to overfitting **
Now in futue when you receive new classes, as you dont know anything about them you can have average weight of evidence applied so that your predictions function dont fail.
Once you have collected enough data for new classes you can just retrain your model to accomodate those.
Solution 2: Use Senetence Embedding to replace
You may use encoding of job description from a NLP model like BERT/wiki vectors etc to encode it in feature space of lets vector size 50. This will also introduce some contextual understanding and works even for New classes. The only problem with be increase in dimensionality.

- 1,994
- 4
- 17
1) Try with LabelEncoding(fast way, but not best way)
#Assuming that you have data in Pandas dataframe
def encode_to_num_df(df):
from sklearn.preprocessing import LabelEncoder
df = df.apply(LabelEncoder().fit_transform)
return df
2) Custom encoding each label - best way, but the hardest. Detailed explanation there. Thanks to @Djib2011
my_mapping = {'bad': 1, 'worse': 2, 'worst': 3}
df['feature'] = df['feature'].map(my_mapping)
Source:

- 1,373
- 8
- 13
- 26
I would first start by using something like scikit-learn LabelEncoder. Examples are in the documentation.
EDITED : With this option, you convert every string to a integer starting from 1.
You can also think about trying to convert the strings into hash values.

- 36
- 2
-
2LabelEncoder outputs values between 1 and k, not 0 and 1. The problem with this is that by giving some class a greater value than other class, the model might be more biased towards the bigger value, which is what I'm trying to avoid. – shulito Apr 29 '19 at 13:29
-
The problem with LabelEncoder pointed by @shulito is true. Apart from that, an algorithm could learn that the sum of two categories are the same as a third category, which would make absolutely no sense. I always take LabelEncoding with great care! It looks like a simple and efficient approach, but it may be dangerous! – 89f3a1c Oct 23 '19 at 22:38
-
Yes my bad, after a few months, I don't know why I wrote 0 to 1. I will edit the answer. You are right, that you might have a biased model but I did have nice surprises in the past trying it anyway. – eetuko Oct 25 '19 at 13:46
-
LabelEncoder is intended for the target column. OrdinalEncoder is the equivalent for features. However, like the name suggests, you're creating an artificial order between the levels of the feature which doesn't exist. – Blenz Feb 24 '22 at 15:25