1

I have a data of a bag of words in a document. The data has 3 columns: {document number, word number, count of the word in the number}. I am supposed to generate frequent item-sets of a particular size.

I thought that I would make list of all words that appear in a document, create a table of this list, and then generate frequent item-sets using Mlxtend or Orange . However, this approach does not seem to be efficient.

Ethan
  • 1,633
  • 9
  • 24
  • 39
never_mind
  • 11
  • 1

1 Answers1

0

If the size is reasonable (i.e. not too many documents and not too many words in a document), you could try to build a map for each possible itemset, for instance like this:

// Assuming data is an array of size N containing all the documents
// clusters is a map associating each itemset with a set of documents
for i=0 to N-1
  for j=i+1 to N-1
    group = overlap(data[i], data[j])
    add data[i] to the set clusters[group]
    add data[j] to the set clusters[group]

An alternative version if the number of different values and size of the sets allow it and/or if it's possible to precompute the itemsets of interest:

for i=0 to N-1
  for every subset S of data[i]
    add data[i] to the set clusters[S] 

(adapted from https://datascience.stackexchange.com/a/60609/64377)

Erwan
  • 25,321
  • 3
  • 14
  • 35