What tools are available for semi-automated matching of dirty columnar data

Question

Are there any automated or semi-automated tools for finding matching "similar" or data in two columnar data sets? The data I'm working with was collected (and handled) by different organizations. Some rows describe the same events and even carry the same unique identifier (about 70% of the smaller data set can be matched using the unique ids). For the unmatched cases, there are slight differences such as typos in the unique id, date differences (+/- 1-2 days), and slight differences in value variables... 997.5 v. 1000, etc.

If, by inspection, the value variables, date, and location are close, I call it a match. Sometimes it is even more obvious such when a one or more digits in the unique ID digit have been transposed or the wrong one has been duplicated.

Thus far, I've been doing this by hand. I've been using R for data cleaning and Excel for comparison and matching (I export comparable subsets, make a crosswalk of old-new unique IDs by hand, and re-import the crosswalk).

What I'd really like is a tool that automates this process by presenting me with information such as: "record 1 in data set A is really similar (for the variables that matter) to records 10, 20, and 25 in data set B." Then I could quickly examine the differences, make a choice (say record 20), and then move on.

I'd like it to feel very similar to the way Ancestry.com helps you match up birth, death, marriage, and census records. But, anything that simplifies and automates this process would be welcome.

Thoughts?

This sounds like a clustering or collaborative filtering problem. Which model to choose depends on your data. Think of it as a recommendation system problem: Which users are the most similar to user x? — Omri374, Mar 01 '16 at 07:44
@Omri374 That makes sense and I agree. I was hoping not to have to build a tool/interface myself if something was already out there. — D. Woods, Mar 01 '16 at 16:42

What tools are available for semi-automated matching of dirty columnar data

0 Answers0