-
Notifications
You must be signed in to change notification settings - Fork 1
Build and deploy the "free-text addresses ETL" #490
Comments
For information, here's @JeniT's work that I was talking about: https://github.com/theodi/parse-uk-addresses We could use a combination of this and the work we've already done in distiller to create a gem, which we can then expose using a web service. |
Just to note that I've heard criticism of that work and it might be worth discussing with experts rather than adopting it (or not - the criticism wasn't concrete iirc). |
About the algorithm used for parsing, my current hypothesis is to use the one that Fusion built into the Corporates House ETL, not https://github.com/theodi/parse-uk-addresses . Of course we can frankenstein the more recent algorithm and the data sources it requires into the gem + web service approach. |
Ah, OK. It should be relatively easy to transfer that logic into Ruby. I'll have a little poke at it. |
I've done a skeleton app here: https://github.com/OpenAddressesUK/sorting_office The hardest part of this is going to be extracting address parts from bare strings. Getting postcodes is easy (just a regex) and posttowns are pretty simple too (there are only a limited number), but localities and streets may be a little trickier. I can't see anything in the Companies House ETL that extracts address elements from bare strings, as my understanding is that the addresses are already split (albeit into Address 1, Address 2 etc), but I could be wrong (my Python knowledge is pretty patchy too) |
Presumably https://github.com/OpenAddressesUK/common-ETL/blob/master/address_lines.py is the relevant code in common-ETL. Looks like it's matching against the database of posttowns, OS locator, etc. |
Do we actually want to port this to Ruby, or shall we implement a wrapper in Python? What would porting give us, apart from being in a language we know better? If we port, then we won't be able to get updates from @MurrayData... |
One of the advantages would be that we can plug in the Mongoid models we have already and (assuming we're connected to the same database) return the URI of each address part too, which would be really useful. Also, it would be easier for us to maintain long term. |
Yeah, that's true. I wonder what @giacecco thinks. |
have pinged @MurrayData to see if he has any test data: OpenAddressesUK/common-ETL#10 |
Rough translation of the relevant parts of https://github.com/OpenAddressesUK/common-ETL/blob/master/CH_Bulk_Extractor.py: https://gist.github.com/Floppy/ac87f1e53142cea445a1 |
Now live and done at https://sorting-office.openaddressesuk.org/ |
Sweet :-D @pezholio I believe we need to do two more things I did not think of specifying before, though: a) all addresses being submitted should also be saved for our own use through #506 when it's ready (@peterkwells I guess we need to state that we're doing that, in the instruction page one sees when calling the sorting office without parameters?), and b) we should return the provenance of the reference tables we use for the normalisation; by specifying some noprov option the user could turn that off Do you need me to open a new issue for that? How many points? |
Yeah, I think well need a new issue for this to be honest. I think 2 points should be enough. |
Even before the new features, I put this back in testing because it looks too easy to break, check OpenAddressesUK/sorting_office#4 out. |
If we were to automatically submit addresses to the platform because they were searched for then we'd need to present the user with the submissions guidelines as per the submit form. (If via API then we'd need that guidance to go up as part of our rules for implementing the API.) For our own website that feels like a thing where we need to consider the UX. How to get the submission guidelines in without damaging the search flow? Note that the search page already has a loop via the text (didn't find what you were looking for? why not submit an address) |
I've now fixed the breaking requests. I've moved this to done, and will now start looking at the extra features. |
The additional reqs are now at #509 . @peterkwells has a very good point, see how I managed that in the new specs. |
The text was updated successfully, but these errors were encountered: