I had to move houses recently and I found that it is tedious to check every potential place's broadband coverage. For that reason I want to build a website where you can check which broadband options are available at a specific address.
I have downloaded JSON documents describing broadband internet options for a place from three different providers. Each has their own JSON format. Currently they are stored in a MongoDB. I have roughly 3x 100k documents.
Example JSON files:
Provider A:
{
"sid":"1734090001",
"keyword_1":"HK",
"keyword_2":"ABERDEEN",
"keyword_3":"ABBA HOUSE ABERDEEN MAIN RD 225",
"keyword_4":"BLK A"
}
Provider B:
{
"area":"HK",
"address":"BLOCK A ABBA HOUSE ABERDEEN HK HK",
"district":"ABERDEEN HK",
"latitude":"22.24822",
"address_short":"BLOCK A ABBA HOUSE ",
"address_en":"BLOCK A ABBA HOUSE ABERDEEN HK HK",
"longitude":"114.152852"
}
Provider C:
{
"AREA_CD":"HK",
"NAME_EN":"ABBA HOUSE BLOCK A",
"STREET_NAME_EN":"ABERDEEN MAIN ROAD",
"STREET_NUM":"225",
"area_desc_en":"HONG KONG",
"district_desc_en":"ABERDEEN",
"housing_addr_en":"ABBA HOUSE BLOCK A, 225 ABERDEEN MAIN ROAD",
"lat":"22.248245",
"lng":"114.15278"
}
For some addresses matching is very easy as the description is exactly the same, down to the single letter. I just need to describe which field matches which in each JSON format.
What I started is writing a Django App and with an address object. I then wrote import routines for each JSON format. But I found that not to be very reliable due to different spellings, abbreviations, etc.
I have considered improving the matching with:
- PyPostal, a fast statistical parser/normalizer for street addresses anywhere in the world (https://github.com/openvenues/pypostal)
- use Google Maps' geocoding API or GeoPy (https://geopy.readthedocs.io/en/stable/)
I did some manual testing and found it hard to tell if there was a real improvement.
Can you think of better ways to match the JSON documents? I am not sure if mine was the right approach or if there are better ways with confidence levels or something like that. How would you design an application like that? I read about Elasticsearch and that sounded interesting, but I have no knowledge about it yet.
Thanks!