4

a newbie here. I am currently self-learning data science. I am working on a dataset that has both categorical and numerical (continuous and discrete) features (26 columns, 30244 rows). Target is numerical (1, 2, 3). I have several questions.

  1. I still have not performed any encoding or scaling techniques. According to my knowledge, as my categorical data are unordered, I have to perform one hot encoding right? As it will increase the number of columns, I am hoping to do that after feature selection. Is that okay?

  2. How can I perform feature selection for this dataset? (Because this has both numerical and categorical data) Should I first do one-hot encoding and then go for checking correlation or t-scores or something like that?

(I am currently focusing on EDA only. I don't have a model in my mind)

Any help is much appreciated. Thank you!

leahnanno
  • 73
  • 1
  • 4
  • Do let me know if you are satisfied with the answer? If not I will try my best possible way to edit it. Please consider accepting the answer if it answers your question. – Devashish Prasad Jun 03 '21 at 15:14
  • 1
    Hi @DevashishPrasad thank you for taking your time answering and I am satisfied with your answer as it cleared many of my doubts :) – leahnanno Jun 03 '21 at 20:14

1 Answers1

1

I have to perform one hot encoding right?

Yes

As it will increase the number of columns, I am hoping to do that after feature selection. Is that okay?

No, you should do basic preprocessing like dealing with missing values and then proceed for handling categorical data before feature selection. Beware of nominal vs ordinal features.

How can I perform feature selection for this dataset?

There are many ways to perform feature selection. You can use the methods you mentioned as well many other methods like -

  1. L1 and L2 regularization
  2. Sequential feature selection
  3. Random forests
  4. More techniques in the blog

Should I first do one-hot encoding and then go for checking correlation or t-scores or something like that?

There is a great answer on this issue here.

Devashish Prasad
  • 834
  • 7
  • 17
  • Thank you for your answer! – leahnanno May 30 '21 at 19:26
  • @leahnanno, your welcome, if it answers your question then you can accept the answer :) – Devashish Prasad May 31 '21 at 07:00
  • I think your answer needs further clarification on correlation treatment for categorical variables. I think your knowledge on categorical data is limited. Take a look at Agresti book on categorical data analysis and you will realise how shallow is your knowledge. – mnm May 31 '21 at 13:33
  • Thanks for pointing that out @mnm. I will surely try to improve my answer. – Devashish Prasad May 31 '21 at 13:45
  • @DevashishPrasad can you please take a look https://stats.stackexchange.com/questions/587252/role-of-onehotencoding-in-feature-selection. I believe my question is similar the given question. – Encipher Aug 30 '22 at 19:00