2

I have close to 50000 documents in plain text format.

Is there a way in which I can group similar documents together? Similarity mostly here is the content similarity.

Will transforming the text into a vector (using TFIDF) and running a K-Means (unsupervised learning) algorithm on top of that help? Are there any better approaches that could be used?

Ethan
  • 1,633
  • 9
  • 24
  • 39
praneeth
  • 149
  • 4

2 Answers2

3

I did something similar a while ago. We wanted to classify several types of pdf.

  • We first extracted the text of the documents.
  • We created NLP features with the text
  • Then added pdf metadata: size of the file, number of pages, name of the document...
  • We then built a classification model with a few samples and did Active Learning

I guess that you could also do unsupervised learning but I like it more when you can do supervised learning.

Carlos Mougan
  • 6,252
  • 2
  • 18
  • 48
0

A common approach for this is LDA (Latent Dirichlet Allocation), which not only gives you the groups, but also a way to identify the topics of the groups by giving you the most common or distinctive words for each topic.

Ethan
  • 1,633
  • 9
  • 24
  • 39
Schnipp
  • 91
  • 2