XGBoost Huge Dataset ~1TB

Question

Can a gradient boosting solution like XGBoost or Lightbgm be used for a huge amount of data ? I have a csv file of 820GB containing 1 Billion observations and each observation has 650 datapoints.

Is this too much data for XGBoost ? I have searched all over the internet for a solution to when the data won't fit into RAM memory to no avail. I read about external memory for xgb but there is no detailed doc. Can someone point me in the right direction please and thank you !

exactly what I also want to know for a while with reference to LightGBM. The only thing I know so far is, that you can use CatBoost to "learn from file". But they have no streaming tool yet. Maybe this post about reading data helps https://towardsdatascience.com/how-to-learn-from-bigdata-files-on-low-memory-incremental-learning-d377282d38ff — Peter, Jun 15 '19 at 10:06
Thank for the link, but it's strange there is little info on how to do gradient boosting on massive datasets. Would be super limited if it only works if you can put the data into memory — Medz Benz, Jun 15 '19 at 17:56
I think Apache Spark has some extensions/implementations that might be of interest:
https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html https://spark.apache.org/docs/2.2.0/ml-classification-regression.html

Otherwise, is it absolutely necessary to train using all one billion observations? Depending on how difficult your problem is, surely the learning curve tails off very badly after a certain amount of observations? Like a million maybe, or even less? — aranglol, Jun 15 '19 at 22:31

score 2 · Answer 1 · answered Jul 19 '19 at 16:55

2

1) Split data in smaller blocks

2) learn clf on first block

3) Dump model into pickle

4) load second block

5) load model from pickle

6) learn clf on second block and so on

answered Jul 19 '19 at 16:55

fuwiak

1,373
8
13
26

rrpelgrim · Answer 2 · 2022-01-04T12:52:51.690

1

Yes, you can train XGBoost in parallel using the Dask backend.

Short Solution

Training XGBoost in parallel with Dask requires 2 changes in your code:

substitute dtrain = xgb.DMatrix(X_train, y_train) with dtrain = xgb.dask.DaskDMatrix(X_train, y_train)
substitute xgb.train(params, dtrain, ...) with xgb.dask.train(client, params, dtrain, ...)

edited Jan 04 '22 at 12:52

answered Jan 04 '22 at 12:36

rrpelgrim

41
3

I have written a step-by-step tutorial on the Coiled blog that you may find useful to get started. Disclaimer: I work for Coiled, a paid service that offers managed Dask clusters. – rrpelgrim Jan 04 '22 at 12:52

score 0 · Answer 3 · answered Dec 19 '19 at 12:25

You can try to split you dataset into 100 smaller datasets and build independent models After that generate new dataset with 100 features where each feature is a prediction of some model from 100. Train the final model using semi-independent predictors

At least it will allow you to decrease computation time significantly

score 0 · Answer 4 · answered Dec 19 '19 at 13:18

0

You can use Vaex and Koalas with Sparkling water from H2O. All the three combined if you have computation infra. You can process than pretty much very distributed with ease

answered Dec 19 '19 at 13:18

Syenix

359
1
6

XGBoost Huge Dataset ~1TB

4 Answers4