1

If I want to train a regression model through tree based algorithms like XGBoost. Suppose that there have 5 features x1, x2, x3, x4, x5 and a target y. And some experts said x2 minus x3 is highly correlate to y. Should I put x2-x3 in the model as the sixth feature, or XGBoost will automatically learn it by just put x1~x5 in model.

As I know, a linear mode can learn a formula from features, and how about tree based methods? If tree based can also do the same thing, does the size of data matter?

PoCheng.Lin
  • 155
  • 2
  • 8
  • I believe that the model you use relates the features as they are, not combines them. So, if you think that a combination of features can help you have to calculate and add it to the dataset. Then use this new dataset to do the train and then the test. – Inuraghe Nov 18 '21 at 11:31

1 Answers1

2

XGBoost will not learn "interactions" on its own. Feature generation is often used to enhance the explanatory power of $X$. Often $x_n - x_k$ or $x_n / x_k$ are checked and used. There are also tools for feature generation, e.g. "Featuretools" for Python.

One thing you can do to find out what kind of interactions have the most explanatory power, you can fit trees with only few splits (three or so) on all the possible interactions (one interaction after another, so one shallow model per interaction) and check the prediction (e.g. MSE, MAE) for each case, such as:

$$ y(x_1-x_2), y(x_1/x_2), ..., y(x_1-x_n), y(x_1/x_n),$$ $$ y(x_2-x_1), y(x_2/x_1), ..., y(x_2-x_n), y(x_2/x_n),$$ $$...$$

You could keep only those interactions which have "high" explanatory power so to avoid having a massive amount of features in the model.

Peter
  • 7,446
  • 5
  • 19
  • 49