3

I have a large sparse data matrix (bag of words, over large number of entries). I can easily treat it as a sparse matrix in sklearn models such as RandomForest. But, if I want to use Catboost, I need to turn it into a dense matrix. I was wondering if there is any efficient method to work with Catboost that doesn't cause this? For example, any internal built-in feature such as TFRecords of Tensorflow, to load bacthes.

Stephen Rauch
  • 1,783
  • 11
  • 22
  • 34

2 Answers2

1

This is an old question, but as of catboost 0.17, sparse matrices are supported.

A pandas.SparseDataFrame or a scipy.sparse.spmatrix can be inputted as features X as described in the updated documentation.

Hope this helps!

Dudelstein
  • 115
  • 6
0
  1. what is the source of sparsity ? have you used one-hot encoder for example ? if so - you didn't need to do so when it comes to using boosting algorithm, so go back and feed the boosting with your original data.

  2. you can always have a auto-encoder to dense your sparse matrix in TensorFlow and run a Boosting algorithm on the result. There are two pit fall in such approach 1) boosting algorithm are not good with continuous values which are the result of the auto-encoder 2) your auto-encoder is an approximate method which of course adds into the error model

  3. consider designing your own network architecture which combines boosting and auto-encoder. For example a few layer to dense your sparse matrix and then a booster tree classifier similar to https://www.tensorflow.org/tutorials/estimator/boosted_trees - when you did so please update this answer.

user702846
  • 323
  • 1
  • 15