Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.

Build and deploy the "free-text addresses ETL" #490

Closed
5 tasks
giacecco opened this issue Jan 5, 2015 · 18 comments
Closed
5 tasks

Build and deploy the "free-text addresses ETL" #490

giacecco opened this issue Jan 5, 2015 · 18 comments
Assignees
Milestone

Comments

@giacecco
Copy link

giacecco commented Jan 5, 2015

  • Re-write as a re-usable software component the algorithm that is currently part of the Corporates House ETL that interprets free-text addresses.
  • Make it available internally to OA...
  • ... and externally as an API, e.g. to be used by the upcoming pure JavaScript ETL (AKA, the "pizza delivery" scenario lightweigth client-side integration).
  • Link the OA website to the new component, so that addresses being submitted interactively by users are actually processed for real rather than simply stored for later
  • After the Turbot decision is taken, modify the Companies House ETL to use the new component
@giacecco giacecco added this to the Sprint #39 milestone Jan 5, 2015
@giacecco giacecco changed the title Free-text ETL Build and deploy the "tree-text addresses ETL" Jan 5, 2015
@giacecco giacecco changed the title Build and deploy the "tree-text addresses ETL" Build and deploy the "free-text addresses ETL" Jan 5, 2015
@giacecco giacecco added 3pt and removed 2pt labels Jan 5, 2015
@pezholio
Copy link
Collaborator

pezholio commented Jan 9, 2015

For information, here's @JeniT's work that I was talking about:

https://github.com/theodi/parse-uk-addresses

We could use a combination of this and the work we've already done in distiller to create a gem, which we can then expose using a web service.

@JeniT
Copy link
Member

JeniT commented Jan 9, 2015

Just to note that I've heard criticism of that work and it might be worth discussing with experts rather than adopting it (or not - the criticism wasn't concrete iirc).

@giacecco
Copy link
Author

giacecco commented Jan 9, 2015

About the algorithm used for parsing, my current hypothesis is to use the one that Fusion built into the Corporates House ETL, not https://github.com/theodi/parse-uk-addresses . Of course we can frankenstein the more recent algorithm and the data sources it requires into the gem + web service approach.

@pezholio
Copy link
Collaborator

pezholio commented Jan 9, 2015

Ah, OK. It should be relatively easy to transfer that logic into Ruby. I'll have a little poke at it.

@pezholio
Copy link
Collaborator

I've done a skeleton app here:

https://github.com/OpenAddressesUK/sorting_office

The hardest part of this is going to be extracting address parts from bare strings. Getting postcodes is easy (just a regex) and posttowns are pretty simple too (there are only a limited number), but localities and streets may be a little trickier. I can't see anything in the Companies House ETL that extracts address elements from bare strings, as my understanding is that the addresses are already split (albeit into Address 1, Address 2 etc), but I could be wrong (my Python knowledge is pretty patchy too)

@Floppy
Copy link

Floppy commented Jan 13, 2015

Presumably https://github.com/OpenAddressesUK/common-ETL/blob/master/address_lines.py is the relevant code in common-ETL. Looks like it's matching against the database of posttowns, OS locator, etc.

@Floppy
Copy link

Floppy commented Jan 13, 2015

Do we actually want to port this to Ruby, or shall we implement a wrapper in Python? What would porting give us, apart from being in a language we know better? If we port, then we won't be able to get updates from @MurrayData...

@pezholio
Copy link
Collaborator

One of the advantages would be that we can plug in the Mongoid models we have already and (assuming we're connected to the same database) return the URI of each address part too, which would be really useful. Also, it would be easier for us to maintain long term.

@Floppy
Copy link

Floppy commented Jan 13, 2015

Yeah, that's true. I wonder what @giacecco thinks.

@Floppy
Copy link

Floppy commented Jan 14, 2015

have pinged @MurrayData to see if he has any test data: OpenAddressesUK/common-ETL#10

@Floppy
Copy link

Floppy commented Jan 14, 2015

@pezholio
Copy link
Collaborator

Now live and done at https://sorting-office.openaddressesuk.org/

@giacecco
Copy link
Author

Sweet :-D

@pezholio I believe we need to do two more things I did not think of specifying before, though:

a) all addresses being submitted should also be saved for our own use through #506 when it's ready (@peterkwells I guess we need to state that we're doing that, in the instruction page one sees when calling the sorting office without parameters?), and

b) we should return the provenance of the reference tables we use for the normalisation; by specifying some noprov option the user could turn that off

Do you need me to open a new issue for that? How many points?

@pezholio
Copy link
Collaborator

Yeah, I think well need a new issue for this to be honest. I think 2 points should be enough.

@giacecco
Copy link
Author

Even before the new features, I put this back in testing because it looks too easy to break, check OpenAddressesUK/sorting_office#4 out.

@peterkwells
Copy link

If we were to automatically submit addresses to the platform because they were searched for then we'd need to present the user with the submissions guidelines as per the submit form. (If via API then we'd need that guidance to go up as part of our rules for implementing the API.)

For our own website that feels like a thing where we need to consider the UX. How to get the submission guidelines in without damaging the search flow?

Note that the search page already has a loop via the text (didn't find what you were looking for? why not submit an address)

@pezholio
Copy link
Collaborator

I've now fixed the breaking requests. I've moved this to done, and will now start looking at the extra features.

@giacecco
Copy link
Author

The additional reqs are now at #509 . @peterkwells has a very good point, see how I managed that in the new specs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants