I have doubt that should I perform outlier analysis and normalization even on target variable which is continuous?
-
Kaggle competitions are won by doing precise outlier analysis of y compared to X. For example, we calculate y_hat based on LightGBM. Then we delete all y points outside of 2 standard dev. from y_hat and retrain. Repeat several times. – keiv.fly Mar 03 '19 at 11:09
1 Answers
You should do Outlier Analysis of your target variable to prepare your training data for the model. Most model would perform better on noiseless data, as Outlier might skew the findings of your model in one direction.
Generally, there is no need to perform normalization on target variable for model performance or accuracy. (Though it might be useful to do some analysis on target variable to get some useful business insights out of it)
Reasons behind performing normalization on input variables are as follows:
1) Feature scaling improves convergence of steepest descent algorithms
2) Helps to avoid a situation when several variables dominate other variables in magnitude
While if you normalize target variable, it, in turn, will normalize MSE and there will be no impact on results.
Cases when you might choose to do Normalization of Target Variables:
Neural Networks update weights of nodes based on the error generated by the cost function via back propagation, large errors might cause drastic change in the weights and could make learning process unstable. Results of which optimizer might not be able to settle on Optimal Minima.
The only time when you might want to normalize target is the case of floating point overflow. Sometimes the number is too large or too small that CPU memory can't handle it and will turn into INF or wrap-around to the other extreme representation.

- 638
- 3
- 5
-
1outlier analysis might help focusing the model on the 'more interesting' business cases, depends on the problem one is aiming to solve. It is not correct stating that outlier analysis is not needed. – yoav_aaa Mar 03 '19 at 09:19
-
-
Preet, the first sentence in your answer is still very misleading. There may be different reasons to perform outlier analysis(decide on modeling strategy, pre processing, etc..). As a general practice it makes very much sense to preform such analysis. – yoav_aaa Mar 03 '19 at 09:26
-
Hi Preet, thanks for your inputs, but when I checked the link "https://stats.stackexchange.com/questions/111467/is-it-necessary-to-scale-the-target-value-in-addition-to-scaling-features-for-re" the last comment says that it's necessary to scale the target variable also, so I'm bit confused to come to a conclusion with the answers. – Navneeth Mar 03 '19 at 12:54
-
1This answer is incomplete. For example, if you use linear regression with OLS, a decision tree, or a decision tree ensemble you do not have to scale your target variable. Even though nothing bad would happen, if you would scale it. But if you do regression with a neural network you definitely do need to normalize or standardize the target variable. Otherwise, the calculated loss will be way too high which would cause very aggressive weight updates during backpropagation and would lead to an unstable model. – georg-un Jul 31 '19 at 11:45