How to process word similarity and categorize a group of words to a single word

Question

Am new in this area and have been searching for some time only to find multiple different possible approaches but nothing concrete.

If I have a wordlist of say; email_addr, email, email_address, address or more dissimilarly first, first_name, firstName, christianName, christian_name, name. What would be the most suitable approach to classify each of those lists to a single word, like email or givenName respectively?

I've seen some articles proposing; Levenstein, fuzzy matching, difference algorithm, support vector machines of which I don't think any quite satisfy the requirement, unless I am missing something.

Would appreciate any links or direction to research.

Essentially, the objective is to categorize all column names in a data set so I can map them to a method for each type of column to generate mock data.

Hey, Did you find any solution or approach for this problem? — Gopi, Feb 10 '24 at 06:42

score 0 · Answer 1 · answered Feb 19 '22 at 12:07

Some ideas:

"Cluster" words in a single list to find the "closest" matches. This could be useful since in email_addr, email, email_address, address the word address could be seen as an "outlier". You can use affinity propagation to cluster words if needed. However, I think this step is only needed if there is a lot of "variance" in the words.
Once you have an okay list of words such as email_addr, email, email_address, you can apply a pairwise levenshtein distance to each word pair and pick the $n$ "closest" matches ("pairs"). With three words (as above), keeping the two closest matches would likely yield: email_addr, email_address.
Keep as a "truth" the common parts of the $n$ top matches, which could be email_addr or simply email in this case.

I have a similar problem in the moment and would apprechiatre any insights from your experiance.

For the examples given I would like all words in the list to "map to" a given word/category, with weight might also be necessary, as address could also be a street address within a different context. — click2install, Feb 19 '22 at 12:21

How to process word similarity and categorize a group of words to a single word

1 Answers1