I have a large sparse data matrix (bag of words, over large number of entries). I can easily treat it as a sparse matrix in sklearn
models such as RandomForest
. But, if I want to use Catboost
, I need to turn it into a dense matrix. I was wondering if there is any efficient method to work with Catboost that doesn't cause this? For example, any internal built-in feature such as TFRecords
of Tensorflow
, to load bacthes.

- 1,783
- 11
- 22
- 34

- 191
- 6
-
2See this issue: https://github.com/catboost/catboost/issues/1 – OmG Nov 01 '17 at 04:14
2 Answers
This is an old question, but as of catboost 0.17, sparse matrices are supported.
A pandas.SparseDataFrame
or a scipy.sparse.spmatrix
can be inputted as features X as described in the updated documentation.
Hope this helps!

- 115
- 6
what is the source of sparsity ? have you used one-hot encoder for example ? if so - you didn't need to do so when it comes to using boosting algorithm, so go back and feed the boosting with your original data.
you can always have a auto-encoder to dense your sparse matrix in TensorFlow and run a Boosting algorithm on the result. There are two pit fall in such approach 1) boosting algorithm are not good with continuous values which are the result of the auto-encoder 2) your auto-encoder is an approximate method which of course adds into the error model
consider designing your own network architecture which combines boosting and auto-encoder. For example a few layer to dense your sparse matrix and then a booster tree classifier similar to https://www.tensorflow.org/tutorials/estimator/boosted_trees - when you did so please update this answer.

- 323
- 1
- 15