Tree based Classifiers with Label Encoder and One Hot Encoder

Question

I m working with Tree-based classifiers in scikit-learn - Decision Trees and Random Forest, for a data classification use case, and the feature set is a mix of both categorical (majority) and numerical features. The scikit-learn Decision Trees / Random Forest can handle only numerical values, so I have used both LabelEncoder and OneHotEncoder that comes with the framework to transform the categorical features into numerical ones. Comparing the performance metrics using each, which came up to be similar, LabelEncoded data performed slightly better in terms of processing time, resource consumption, and final accuracy stats.

So my question is there anything fundamentally wrong with using LabelEncoder here?. A number of posts online suggest it's not recommended to transform categorical features using LabelEncoder, as it enforces an order to non-ordinal features. Even if it does, will it affect these tree-based models in any way?

score 0 · Answer 1 · answered Dec 13 '21 at 07:25

Since you have mentioned more Categorical columns and you have to convert into numerical data using Encoding methods. Choice of choosing right encoding technique gives good performance.

Label Encoding (Gives output as 0 and 1, mostly this will be applied to your target variable which is having only 2 class. If you apply to this to any feature having value yes/no then you can go ahead and apply.
One hot encoding - if you have more than 2 unique values in column then you can do this option, still this will increase your total columns count drastically if you apply for more unique values column.
Target Encoding - This one converts the categorical value into numerical values with some definition and with no increase in column count.
Please use the right technique for each column.

Tree based Classifiers with Label Encoder and One Hot Encoder

1 Answers1