1

BACKGROUND: I have dataset that includes Race (e.g., White, Black) and Ethnicity (e.g., Hispanic, Non-Hispanic) as observed variables. The dataset also includes Race_Ethnicity (e.g., Hispanic White, Non-Hispanic Black) as an engineered variable, if you will. I am am wondering if I should retain the observed variables in my supervised ML model?

The observed variables are obviously correlated with the engineered variable. This is an issue for ML (i.e., the multicollinearity problem), if I am thinking about this correctly (but please correct me if I'm wrong). However, it may be possible that Race interacts with yet a 4th variable, whereas Ethnicity does not. Thus, leaving out Race may be costing me important boost in performance. (Race_Ethnicity may have a more "muddied" relationship with the 4th variable than Race alone.)

QUESTION: What to do, y'all? Should they (the observed variables) stay or should they go?

  • I would go with using the observed variables only. In that way, the model could draw and show, in its output, better relations between features, make possible better generalisations and be more interpretable. Adding the engineered ones will be just a nuisance. – 20-roso Dec 09 '22 at 13:57
  • @20roso, so the entire discipline of feature engineering is a waste? – Snehal Patel Dec 09 '22 at 19:11
  • In Feature engineering you do typically feature selection or feature extraction (e.g. PCA). In your case the features are concatenated, it's neither of both. In the same sense, why not concatenate all the features together and have just one? You get my point. Anyways, I told you my opinion, in the end you will be the judge, it's your dataset and you know it best. – 20-roso Dec 10 '22 at 14:48

0 Answers0