I'm sorry if the title is misleading, but I didn't really know how to explain what I am searching for. I have a dataset containing two columns representing names and surnames of a bunch of people. These might be inserted in multiple records. However, sometimes the name is put in the surname field and viceversa. Also, there might be some typing mistakes. I was thinking about merging these into a single string (NameSurname) in order to find similarities between records and fix the fields. I have looked at some string similarity metrics, but I see that the most popular ones look at consecutive characters and would fail to recognize SurnameName and NameSurname as the same string. Is there any metric robust to this? Thank you a lot in advance.
Asked
Active
Viewed 119 times
0
-
1You can always perform 2 searches: "NameSurname" and "SurnameName" and merge the two result sets. About robustness to typos, you can use fuzzy matching techniques (e.g. this) – noe Feb 22 '23 at 13:32
-
You might be interested about the different kinds of similarity measures here and there – Erwan Feb 23 '23 at 15:45