Architecture question: Match addresses in different JSON schemas

Question

I had to move houses recently and I found that it is tedious to check every potential place's broadband coverage. For that reason I want to build a website where you can check which broadband options are available at a specific address.

I have downloaded JSON documents describing broadband internet options for a place from three different providers. Each has their own JSON format. Currently they are stored in a MongoDB. I have roughly 3x 100k documents.

Example JSON files:

Provider A:

{
"sid":"1734090001",
"keyword_1":"HK",
"keyword_2":"ABERDEEN",
"keyword_3":"ABBA HOUSE ABERDEEN MAIN RD 225",
"keyword_4":"BLK A"
}

Provider B:

{
"area":"HK",
"address":"BLOCK A ABBA HOUSE  ABERDEEN HK HK",
"district":"ABERDEEN HK",
"latitude":"22.24822",
"address_short":"BLOCK A ABBA HOUSE ",
"address_en":"BLOCK A ABBA HOUSE  ABERDEEN HK HK",
"longitude":"114.152852"
}

Provider C:

{
"AREA_CD":"HK",
"NAME_EN":"ABBA HOUSE BLOCK A",
"STREET_NAME_EN":"ABERDEEN MAIN ROAD",
"STREET_NUM":"225",
"area_desc_en":"HONG KONG",
"district_desc_en":"ABERDEEN",
"housing_addr_en":"ABBA HOUSE BLOCK A, 225 ABERDEEN MAIN ROAD",
"lat":"22.248245",
"lng":"114.15278"
}

For some addresses matching is very easy as the description is exactly the same, down to the single letter. I just need to describe which field matches which in each JSON format.

What I started is writing a Django App and with an address object. I then wrote import routines for each JSON format. But I found that not to be very reliable due to different spellings, abbreviations, etc.

I have considered improving the matching with:

PyPostal, a fast statistical parser/normalizer for street addresses anywhere in the world (https://github.com/openvenues/pypostal)
use Google Maps' geocoding API or GeoPy (https://geopy.readthedocs.io/en/stable/)

I did some manual testing and found it hard to tell if there was a real improvement.

Can you think of better ways to match the JSON documents? I am not sure if mine was the right approach or if there are better ways with confidence levels or something like that. How would you design an application like that? I read about Elasticsearch and that sounded interesting, but I have no knowledge about it yet.

Thanks!

score 2 · Answer 1 · answered Jul 29 '20 at 09:04

Your example doesn't feel like an architectural question, and more an implementation question.

Implementation wise, all of those are ways to implement a solution to the high level problem. Take your pick, if it doesn't fit do something else.

Which leads me to the architectural problem behind your implementation query. How do I map between data sets with no clear bridging key?

Which just so happens to be the study of Ontology. This field goes pretty deep, and for the most part it is an unsolved problem, even though there are many specific solutions or heuristics, there is no golden solution that if followed always works.

One heuristic we can observe is that: Most things are the only individual at an intersection of sets of properties.

From this we can develop an algorithm, that needs some human interaction.

To start with it would help to initialise some sets:

the set of postcodes
the set of suburbs
the set of street types
etc...

They don't have to be complete, just some reasonable amount of sets and members (hopefully about addresses, cake recipes won't help much here).

Then group each record as to which intersection of sets it lives at. Some of these intersections will be empty, some will have one record, others many.

Do this on file at a time, because each file has a reasonable chance of containing no duplicate records, but the same cannot be said for across files (particularly as they are meant to overlap).

Pick the largest group of unsorted records.

identify keywords that should be part of a set, and add them. eg: st and add it to street types
identify new sets (like street names, place names, etc...) eg: unit numbers (such as U1, U2, etc...)

Repeat until each intersection has one record (or you have identified those records as being identical).

Add each file, until all are added. Ensure that records are marked as to which file they come from.

Once all the files have been added its time to verify the quality of the sorting.

An intersection with a record from each file, is probably good. In terms of manual verification check these last.
An intersection lacking one or two records, but still having a lot of records is probably okay, they should be checked with a medium priority manually.
An intersection with one or two records is probably not a good match due to how few agreements it has. There is a chance that the data was over-fitted, what that means is that:
- the sets contain a synonym a keyword with two forms (eg: st, and street, or rd, and road).
- two of the sets are not orthogonal and overlap somehow.

You can make life significantly easier for yourself if you can find a canonical set of addresses.

This would allow you to setup those sets more easily as every address is guaranteeably unique, and it would serve as a check to verify if you are missing information, or if a record doesn't make sense (as it doesn't map to a know address).

You can also make you life easier by setting up cannonical formats to extract information from an address for lookup in these sets. eg: 12 river st is in the format number name kind. This will work even better if you tailor them to the specific input files.

Architecture question: Match addresses in different JSON schemas

1 Answers1