Is it good practice to convert columns with a number to a range between 0 and 1?

Question

Relatively new to data science. I heard something about converting columns which contain integers into a range between 0 and 1. I think the reasoning was that so all the columns will be more similar in their range. I think along with that there might have been a step of removing outliers (very high integers) to that they wouldn't cause all other results be skewed as a low fraction.

Is this accurate?

If yes, is there an easy command to make it happen with a Pandas dataset?

score 3 · Accepted Answer · answered Mar 05 '20 at 17:46

This transformation is called min-max-scaling and also often referred to as standardization.

Scikit learn provides the MinMaxScaler() for this (see here). Here is an example adapted from "Introduction to Machine Learning with Python" by Mueller and Guido:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
random_state=1)

scaler = MinMaxScaler()

scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

(Side note: keep in mind to fit the scaler only on the training and not the test data!)

In the book "Python Machine Learning" by Raschka the author provides a brief pragmatic comparison of min-max-scaling/standardization to normalization (the latter one means substracting the mean and dividing by variance):

Although normalization via min-max scaling is a commonly used technique that is useful when we need values in a bounded interval, standardization can be more practical for many machine learning algorithms. The reason is that many linear models, such as the logistic regression and SVM, [...] initialize the weights to 0 or small random values close to 0. Using standardization, we center the feature columns at mean 0 with standard deviation 1 so that the feature columns take the form of a normal distribution, which makes it easier to learn the weights. Furthermore, standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling, which scales the data to a limited range of values.

If you make all your training data have values between 0 and 1 and then the test data has a value of 500 in the same column, won't that be a problem? — Philip Kirkbride, Mar 05 '20 at 18:09
@PhilipKirkbride Yes, but that is not what I meant: you need to apply the Scaler on both, the train and test data. But it is important to fit it only on the train data. That is why in the above example it says scaler.fit(X_train) (fitting on train data) and then X_train_scaled = scaler.transform(X_train) (applying to train data) and X_test_scaled = scaler.transform(X_test) (applying to test data). But there is nothing like scaler.fit(X_test). But nevertheless you are correct that the test data will not be scaled to have a min of 0 and a max of 1. — Jonathan, Mar 05 '20 at 18:14

score 3 · Answer 2 · answered Mar 05 '20 at 17:57

I think there si some confusion with the quantile transformer : https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer it actually scale the values to a 0-1 range. the goal is to get a uniform distribution. It serves a different purpose than minmax scaler imho. Notably it could be helpfull if you get outliers.

To get a bigger overview on what different scalers do : https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py

score 1 · Answer 3 · answered Mar 05 '20 at 17:19

1

Firstly, the way that you decide to transform your variables should be dependent on the purpose that you are using them for.

Generally, I would not recommend doing what you stated.

However, something that is commonly done to deal with the problem of variables not being in s a 'similar range' is normalization. To normalize you just subtract the mean from the value and divide by the standard deviation. This results in variables that all have mean 0 and standard deviation 1.

answered Mar 05 '20 at 17:19

fractalnature

805
6
19

Does that mean you end up with negative values? Since if you subtract from a value of 0 you'd have a negative? – Philip Kirkbride Mar 05 '20 at 17:20
Yep, you'd end up with negative values. – fractalnature Mar 05 '20 at 17:21

score 1 · Answer 4 · answered Mar 06 '20 at 05:52

Whether you want to do data transformation, really depends on the algorithm that you are using. Tree based algorithms (Decision Tree, Random Forest, Gradient Boosting algorithms), are scale invariant and thus will not benefit from the transformation. While for K-Nearest Neighbors, you probably would want to scale your features, otherwise features with larger values get disproportional influence.

Is it good practice to convert columns with a number to a range between 0 and 1?

4 Answers4