Is it correct to create topic models using both train and test data?

Question

I have a dataset of text documents splitted into train and test sets. My task is a binary classification, classifying these documents to either 1 or -1. I have already computed some features using TF-IDF and n-grams and tested my model. Now, I want to add some other features using topic models (LDA and LSA) to see if it can improve the F1-score and performance of my model.

My question is fairly simply: should I create my topic models only using my train set? Or, since topic models are not created based on a label (target or dependent variable,) would it be correct if I use both train and test sets to create the topic model?

score 3 · Answer 1 · answered Oct 29 '18 at 20:54

3

You should only use your training set in this context though it may benefit you if you used an additional cross-validation set for feature selection. If you use training data in your feature selection step you will optimistically bias your model and expose it to overfitting because of the data leakage. A similar scenario is also described in this post.

answered Oct 29 '18 at 20:54

kevins_1

717
8
11

Since topic modeling is unsupervised and does not have anything to do with labels in the dataset, I thought maybe just creating the topic model using the whole data (train and test sets) would be OK. But apparently I should only use the training set. – Pedram Oct 29 '18 at 21:26

Is it correct to create topic models using both train and test data?

1 Answers1