1

I have a series of arrays

[Apple,Banana,Cherry,Date]
[Apple,Fig,Grape]
[Banana,Cherry,Date,Elderberry]
[Fig,Grape]

and I would like to build some clusters that associate the arrays into groups based on overlap

Group1: Array1 and Array3 as they have 3 common words
Group2: Array2 and Array4 as they have 2 common words
etc..

I was thinking kmeans but there is really not a distance calculation - more like an overlap one.

Does anyone have a suggestions?

Thanks!

Ethan
  • 1,633
  • 9
  • 24
  • 39
Jamie Dixon
  • 135
  • 4

2 Answers2

0

Assuming the dimensionality is reasonable, I would not use K-means or any generic algorithm, instead I would write a code which directly gives me the exact result by building a map of the groups:

// Assuming data is an array of size N containing all the arrays
// clusters is a map associating each group with a set of arrays
for i=0 to N-1
  for j=i+1 to N-1
    group = overlap(data[i], data[j])
    add data[i] to the set clusters[group]
    add data[j] to the set clusters[group]

An alternative version if the number of different values and size of the sets allow it and/or if it's possible to precompute the groups of interest:

for i=0 to N-1
  for every subset S of data[i]
    add data[i] to the set clusters[S] 
Erwan
  • 25,321
  • 3
  • 14
  • 35
0

Jaccard index is the size of the intersection over the size of union for discrete elements in two sets. Jaccard distance, one minus Jaccard index, could be a metric for clustering those groups of words.

Brian Spiering
  • 21,136
  • 2
  • 26
  • 109