Sequence to carry out data analysis?

Question

I have a dataset with 4700 records and it's a classification problem. Proportion of classes is 33 and 67%

few questions

1) does this proportion qualify dataset as imbalanced ?

2) should I do cross validation and then apply (over/under or SMOTE sampling) or I should first balance my sample through these sampling techniques and then do cross validation?

3) Why is propensity score matching used only in healthcare related studies and not much in other applications?

4) How is Propensity score matching different from other ML classification algorithms?

score 2 · Accepted Answer · answered Dec 09 '19 at 10:32

2

You should fit preprocessing transformers, i.e. imputation, scalers, encoders, resampling, only to train set and apply them to both train and test respectively. Your dataset is imbalanced and you may expect some improvement using resampling techniques, though you should always confirm it conducting cross validation tests.

answered Dec 09 '19 at 10:32

Piotr Rarus

824
5
15

Hi, thanks for the response. Just to make sure I got it right. You are asking me to do sampling (over or under or SMOTE) to the full dataset and achieve a balance in my dataset. Once this is done, do cross validation. Have I understood it right? – The Great Dec 09 '19 at 12:08
upvoted your answer – The Great Dec 09 '19 at 12:08
What do you mean by train and apply them to both train and test in below sentence only to train set and apply them to both train and test respectively. – The Great Dec 09 '19 at 12:09
On the contrary, first you split, fit transformers on train set and then apply them on both train and test. That's basically why you need to use pipeline for imblearn instead of sklearn one. – Piotr Rarus Dec 09 '19 at 12:11
can we break this into simple points. Might be I am struggling to understand due to language proficiency – The Great Dec 09 '19 at 12:42
Have an imbalanced dataset in a csv 2) Do preprocessing work like encoders, standard scalars 3) Apply resampling techniques to balance the dataset (it becomes balanced) 4) Split the dataset into 70% train and 30% test 5) Apply the model on train set 6) fit it on Test dataset

The Great

Dec 09 '19 at 12:44

Have I got it right? – The Great Dec 09 '19 at 12:45

Nope ;P First split, then resample train set. Don't resample test set. – Piotr Rarus Dec 09 '19 at 13:18

Hi, okay. But in this as we aren't resamplin test set, Will it contain equal proportion of classes? Or is your idea like since we have resampled train data, it would have seen enough variations in data for both classes which makes model knolwdegeable to distinguish both the classes in test data (irrespective of the proportion of classes). Am I right? – The Great Dec 09 '19 at 21:12

Over-sampled train set will be balanced. Training model on imbalanced data results in overfitting to over-represented classes. I'd use stratified splits as well. What metrics will you use to score your models? – Piotr Rarus Dec 10 '19 at 09:47

Let us continue this discussion in chat. – The Great Dec 10 '19 at 10:24

Can you help me with this? https://datascience.stackexchange.com/questions/64756/how-to-select-features-for-a-ml-model – The Great Dec 13 '19 at 10:12

Sequence to carry out data analysis?

1 Answers1