feature scaling xgbRegressor

Question

I read for example in this answer: Does the performance of GBM methods profit from feature scaling?

that scaling doesn´t affect the performance of any tree-based method, not for lightgbm,xgboost,catboost or even decision tree.

When i do feature scaling and compare the rmse of a xgboost model without and with minmax scaling, i got a better rmse value with feature scaling. Here is the code:

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
import math
from math import sqrt
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
import xgboost as xgb
data = pd.read_excel(r'C:...path.xlsx')
X = data.drop(['colA'], axis=1)
y = data['colA']
scaler = MinMaxScaler()
scaler.fit(X)
minmax_scaled_X = scaler.transform(X)
minmax_scaled_X
y = np.array(y).reshape(-1, 1)
scaler.fit(y)
minmax_scaled_y = scaler.transform(y)
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(minmax_scaled_X, minmax_scaled_y, test_size =0.3, random_state=0, shuffle=True)
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.7, learning_rate = 0.05,
                max_depth = 8, min_child_weight = 4, n_estimators = 600, subsample = 0.7)
xg_reg.fit(xtrain,ytrain)
preds = xg_reg.predict(xtest)
rmse = sqrt(MSE(ytest, preds))
print(rmse)

the result with min max scaling is 0.003, while the rmse without is about 3.8. I did the same with simple decision tree and got always a better result with minmax scaling.

Where is my mistake? In other posts like the link above, answers are about that it is not good to scale when using trees. Can I say, that min max scaling does have a positive effect on the rmse on my data?

score 6 · Accepted Answer · answered Jul 09 '20 at 10:57

6

You're also scaling $y$, then of course you are getting lower error. That question was regarding scaling $X$.

The same model will have very different error metrics when units on $y$ are changed: if I multiply all $y$ values by 100, the error will be 100 times larger, if I divide all $y$ values by 100 the error will be divided by 100.

answered Jul 09 '20 at 10:57

David Masip

6,051
2
24
61

Thank you very much. When and why do you scale x and y? or only x? Is there a rule or could you please recommend me some literature to read? – martin Jul 09 '20 at 11:50
1

The standard thing to do is scaling X, as you can see in sklearn docs https://scikit-learn.org/stable/modules/preprocessing.html. I haven't ever seen a reason to scale y, but some to transform it – David Masip Jul 09 '20 at 12:06
Thank you I got it. – martin Jul 12 '20 at 20:49

feature scaling xgbRegressor

1 Answers1