Large sparse dataset in Catboost

Question

I have a large sparse data matrix (bag of words, over large number of entries). I can easily treat it as a sparse matrix in sklearn models such as RandomForest. But, if I want to use Catboost, I need to turn it into a dense matrix. I was wondering if there is any efficient method to work with Catboost that doesn't cause this? For example, any internal built-in feature such as TFRecords of Tensorflow, to load bacthes.

See this issue: https://github.com/catboost/catboost/issues/1 — OmG, Nov 01 '17 at 04:14

score 1 · Answer 1 · answered Mar 30 '23 at 08:50

1

This is an old question, but as of catboost 0.17, sparse matrices are supported.

A pandas.SparseDataFrame or a scipy.sparse.spmatrix can be inputted as features X as described in the updated documentation.

Hope this helps!

answered Mar 30 '23 at 08:50

Dudelstein

115
6

score 0 · Answer 2 · answered Jun 01 '21 at 08:59

what is the source of sparsity ? have you used one-hot encoder for example ? if so - you didn't need to do so when it comes to using boosting algorithm, so go back and feed the boosting with your original data.
you can always have a auto-encoder to dense your sparse matrix in TensorFlow and run a Boosting algorithm on the result. There are two pit fall in such approach 1) boosting algorithm are not good with continuous values which are the result of the auto-encoder 2) your auto-encoder is an approximate method which of course adds into the error model
consider designing your own network architecture which combines boosting and auto-encoder. For example a few layer to dense your sparse matrix and then a booster tree classifier similar to https://www.tensorflow.org/tutorials/estimator/boosted_trees - when you did so please update this answer.

Large sparse dataset in Catboost

2 Answers2