2

When doing regression or classification when faced with a categorical attribute with $n$ possible values there are two options:

  1. Feed this attribute directly into your model.
  2. Partition your data into $n$ pieces based on the categorical attribute and train a model for each separately. During inference choose the model appropriately based on the same attribute.

One of the advantages of approach #2 is that it allows you to do more specific feature engineering. E.g. if you are modeling property prices and you decided to make separate models for residential/industrial properties you can choose separate features that are relevant for each.

Another advantage of approach #2 I can think of is that it can linearize otherwise non-linear relations. E.g. for a residential property having a railroad track nearby almost always heavily reduces property value while for an industrial property it could be a massive value booster.

In general, what factors go into deciding between approach #1 and #2?

orlp
  • 121
  • 2

1 Answers1

1

I've tried 2 several times but it has never proved better than 1.

I think the reason is, the more data you feed to a model, the better. The disadvantage of 2 is that the models that are trained use less data than the model in 1.

In addition, some features might be independent of the group. For instance, when modelling property prices, being in the city centre always increases the price, both for residential and industrial.

Let me discuss two of the main models used for tabular data:

  • Tree based models will already do the feature engineering that you described in your first point. The model will already do a split residential/industrial if it contributes to the gain and then it will keep doing particular splits for each group.
  • Linear models: a generalization of general linear models are mixed models, that kind of does what you mentioned on the second point, but keeping some structure that allows it to acknowledge that the city center is more expensive.

That being said, if you have very different categories, it might be worth splitting the dataset, it's just a matter of trying.

David Masip
  • 6,051
  • 2
  • 24
  • 61