4

I need to extract relevant key phrases from a single document. Since I don't have a lot of documents, TF-IDF doesn't really work.

Currently I'm using TextRank. It produces okay-ish result - some really good phrases along with a lot of garbage.

Is there a better algorithm to use for this? Can anyone give me a rundown of available options?

Real-world use case: I'm developing a help desk app that comes with Knowledge Base (a bunch of articles, think of it as FAQ). When a user writes a new support ticket I want to extract key phrases and find the most relevant KB articles. Overall there is not enough data to train a model. I need to compare sets of key phrases I think.

1 Answers1

3

A related keyword to your case can be Single Document Keyword Extraction. A good paper about this is:

We present a new keyword extraction algorithm that applies to a single document without using a corpus. Frequent terms are extracted first, then a set of cooccurrence between each term and the frequent terms, i.e., occurrences in the same sentences, is generated. Co-occurrence distribution shows importance of a term in the document as follows. If probability distribution of co-occurrence between term a and the frequent terms is biased to a particular subset of frequent terms, then term a is likely to be a keyword. The degree of biases of distribution is measured by the $\chi^2$-measure. Our algorithm shows comparable performance to tfidf without using a corpus.

You can find the paper here.

In sum, this paper gives a rank on keywords based on the defined $\chi^2$-measure.

OmG
  • 1,219
  • 9
  • 19