Can a gradient boosting solution like XGBoost or Lightbgm be used for a huge amount of data ? I have a csv file of 820GB containing 1 Billion observations and each observation has 650 datapoints.
Is this too much data for XGBoost ? I have searched all over the internet for a solution to when the data won't fit into RAM memory to no avail. I read about external memory for xgb but there is no detailed doc. Can someone point me in the right direction please and thank you !
https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html https://spark.apache.org/docs/2.2.0/ml-classification-regression.html
Otherwise, is it absolutely necessary to train using all one billion observations? Depending on how difficult your problem is, surely the learning curve tails off very badly after a certain amount of observations? Like a million maybe, or even less?
– aranglol Jun 15 '19 at 22:31