1

Am new in this area and have been searching for some time only to find multiple different possible approaches but nothing concrete.

If I have a wordlist of say; email_addr, email, email_address, address or more dissimilarly first, first_name, firstName, christianName, christian_name, name. What would be the most suitable approach to classify each of those lists to a single word, like email or givenName respectively?

I've seen some articles proposing; Levenstein, fuzzy matching, difference algorithm, support vector machines of which I don't think any quite satisfy the requirement, unless I am missing something.

Would appreciate any links or direction to research.

Essentially, the objective is to categorize all column names in a data set so I can map them to a method for each type of column to generate mock data.

1 Answers1

0

Some ideas:

  1. "Cluster" words in a single list to find the "closest" matches. This could be useful since in email_addr, email, email_address, address the word address could be seen as an "outlier". You can use affinity propagation to cluster words if needed. However, I think this step is only needed if there is a lot of "variance" in the words.
  2. Once you have an okay list of words such as email_addr, email, email_address, you can apply a pairwise levenshtein distance to each word pair and pick the $n$ "closest" matches ("pairs"). With three words (as above), keeping the two closest matches would likely yield: email_addr, email_address.
  3. Keep as a "truth" the common parts of the $n$ top matches, which could be email_addr or simply email in this case.

I have a similar problem in the moment and would apprechiatre any insights from your experiance.

Peter
  • 7,446
  • 5
  • 19
  • 49
  • For the examples given I would like all words in the list to "map to" a given word/category, with weight might also be necessary, as address could also be a street address within a different context. – click2install Feb 19 '22 at 12:21