BACKGROUND: I have dataset that includes Race
(e.g., White, Black) and Ethnicity
(e.g., Hispanic, Non-Hispanic) as observed variables. The dataset also includes Race_Ethnicity
(e.g., Hispanic White, Non-Hispanic Black) as an engineered variable, if you will. I am am wondering if I should retain the observed variables in my supervised ML model?
The observed variables are obviously correlated with the engineered variable. This is an issue for ML (i.e., the multicollinearity problem), if I am thinking about this correctly (but please correct me if I'm wrong). However, it may be possible that Race
interacts with yet a 4th variable, whereas Ethnicity
does not. Thus, leaving out Race
may be costing me important boost in performance. (Race_Ethnicity
may have a more "muddied" relationship with the 4th variable than Race
alone.)
QUESTION: What to do, y'all? Should they (the observed variables) stay or should they go?