Unsupervised Clustering for n-length word arrays

Question

I have a series of arrays

[Apple,Banana,Cherry,Date]
[Apple,Fig,Grape]
[Banana,Cherry,Date,Elderberry]
[Fig,Grape]

and I would like to build some clusters that associate the arrays into groups based on overlap

Group1: Array1 and Array3 as they have 3 common words
Group2: Array2 and Array4 as they have 2 common words
etc..

I was thinking kmeans but there is really not a distance calculation - more like an overlap one.

Does anyone have a suggestions?

Thanks!

score 0 · Accepted Answer · answered Sep 23 '19 at 01:13

Assuming the dimensionality is reasonable, I would not use K-means or any generic algorithm, instead I would write a code which directly gives me the exact result by building a map of the groups:

// Assuming data is an array of size N containing all the arrays
// clusters is a map associating each group with a set of arrays
for i=0 to N-1
  for j=i+1 to N-1
    group = overlap(data[i], data[j])
    add data[i] to the set clusters[group]
    add data[j] to the set clusters[group]

An alternative version if the number of different values and size of the sets allow it and/or if it's possible to precompute the groups of interest:

for i=0 to N-1
  for every subset S of data[i]
    add data[i] to the set clusters[S]

score 0 · Answer 2 · answered Oct 23 '19 at 04:23

0

Jaccard index is the size of the intersection over the size of union for discrete elements in two sets. Jaccard distance, one minus Jaccard index, could be a metric for clustering those groups of words.

answered Oct 23 '19 at 04:23

Brian Spiering

21,136
2
26
109

Unsupervised Clustering for n-length word arrays

2 Answers2

Linked