When is feature transformation required?

Question

I was fitting machine learning models to clean data(Imputed missing values, removed unnecessary features etc). I didn't transform the features that are skewed. Before moving forward, I want to understand how important feature transformation is to fit data into a model. Any opinions?

(I know what happens in Random Forest, but unable to comprehend for other ML models)

score 0 · Accepted Answer · answered Jun 18 '18 at 14:15

Even if there are models that are robust w.r.t. feature transformations (like Random Forest), in general it's a good practice to transform the features in order to have better performance in a ML model.

For three reasons:

1) Numerical stability: computers cannot represent every number, because the electronic which make them exist deals with binaries (zeros and ones). So they use a representation based on Floating Point arithmetic. In practice, this means that the numerical behavior in the range [0.0, 1.0] is not the same of the range [1'000'000.0, 1'000'001.0]. So having two features that have very different scales can lead to numerical instability, and finally to a model unable to learn anything.

2) Control of the gradient: imagine that you have a feature that spans in a range [-1, 1], and another one that spans in a range [-1'000'000, 1'000'000]: the weights associated to the first feature are much more sensitive to small variations, and so their gradient will become much more variable in the direction described by that feature. This can lead to other instabilities: some values of learning rate (LR) can be too small for one feature (and so the convergence will be slow) but too big for the second feature (and so you jump over the optimal values). And so, at the end of the training process you will have a sub-optimal model.

3) control of the variance of the data: if you have skewed features, and you don't transform them, you risk that the model will simply ignore the elements in the tail of the distributions. And in some cases, the tails are much more informative than the bulk of the distributions.

I get the first 2 points you mentioned. Suppose I just scale the data and not normalize it. If I am using random forest, it will just simply split the feature based on entropy or gini index. How will a skewed feature matter here? — jdwins11, Jun 21 '18 at 14:01
In fact, Random Forest & co. are robust against those problems, because they works on clear separation (e.g. "do this feature is bigger than this value?), and not on algebraic operations (e.g. matrix multiplications, as in Neural Networks). So you don't need to scale and/or normalize features. Sorry if that was not clear in the answer — Vincenzo Lavorini, Jun 21 '18 at 14:37

When is feature transformation required?

1 Answers1