Cleaning NaNs with averages pre or post split?

Question

I have a column with some NaNs in it and I want to replace those NaNs with the average/median/mode.

Technically, the validation/ test data has never been seen before - so how could I include it in the average? That would bias the values.

Do I "fit" the average to my training data only, just like scaling? Or do I take the average using the entire dataset?

ha. just tried to close as duplicate, but won't let me reference the stats.stackexchange link — Kermit, Jan 06 '21 at 01:06

score 2 · Answer 1 · answered Jan 05 '21 at 21:08

Imputing all missing values before the split would mean data leakage - you would use test set information to influence the training set.

I had a similar problem recently, I ended up using complete cases of the source data for the test set and then imputing the training data using medians calculated by class.

Otherwise, you might split the data and then impute the test set using the training set average/median/mode.

Cleaning NaNs with averages pre or post split?

1 Answers1