1

I am interested in learning what routine others use (if any) for Feature Reduction/Selection.

For example, If my data has several thousand features, I typically try {2,3,4} things right away depending on circumstances.

  1. Zero variance/Near zero variance

    • Using R package caret, nzv
    • I find a v.small percentage is zero variance and a few more are near zero variance.
    • Then by using nzv$PercentUnique I may remove the bottom quartile of features depending on the range of PercentUnique's.
  2. Correlation to find multicollinearity

    • I find the correlation matrix and remove values > 0.75 and remove.
    • I have seen others use correlations > 0.5 or 0.6, but don't have any references for it.
  3. Boruta / Random Forest

    • Love Boruta package but it takes a while.
    • Then here again use Forward Feature Selection.
  4. PCA

    • Depending on the nature of the data I will try PCA last.
    • If the model must be explainable then I skip this.
    • I may use several criteria: 80, 90, 95% error explained
    • Forward Feature selection, look for first ~3:10 orthogonal features

NOTE: I am not suggesting this is the best/worst routine but I'm opening the floor to civil debate. If you need a definition for Civil Debate see Wikipedia.

mccurcio
  • 223
  • 2
  • 11
  • 1
    honest question, but stackexchange sites like datasciense.se are not for debating things but for objectively defined questions and answers – Nikos M. Aug 22 '20 at 06:30
  • @nikos-m After thinking about your post, I remembered this, https://datascience.stackexchange.com/questions/34357/why-do-people-prefer-pandas-to-sql Granted, It is not a code related question like S.E.but on topic. I'm torn... – mccurcio Aug 22 '20 at 17:58
  • This older post discusses the merits of procedures, https://datascience.stackexchange.com/questions/14864/fix-first-two-levels-of-decision-tree The question becomes, 'are people willing to broaden the scope here?' – mccurcio Aug 22 '20 at 18:13

0 Answers0