Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix ids for Swedish data #71

Closed
augusto-herrmann opened this issue Jan 19, 2016 · 10 comments · Fixed by #112
Closed

fix ids for Swedish data #71

augusto-herrmann opened this issue Jan 19, 2016 · 10 comments · Fixed by #112
Assignees
Labels
Data Data sources and ingestion automation

Comments

@augusto-herrmann
Copy link
Collaborator

Swedish data have blank cells for the 'id' column.
These should be fixed with a code that includes jurisdiction and local identifier:

se/{local-id}

as the example in http://data.okfn.org/data/okfn/public-bodies .

@todrobbins is working on generating slugs for the data.
Once #65 is solved we can also add the official local ids as well, as @mattiasaxell mentioned having that.

@augusto-herrmann
Copy link
Collaborator Author

If you need help with the slug generation, just say so.

@augusto-herrmann augusto-herrmann added the Data Data sources and ingestion automation label Jan 19, 2016
@todrobbins
Copy link
Contributor

@augusto-herrmann if you could take over, that would be great.

@augusto-herrmann
Copy link
Collaborator Author

Ok, @todrobbins, I've got it for now.

Before considering this issue closed, @mattiasaxell please check if the generated ids are acceptable.

  • method of slug generation considered the closest latin character, discarding diacritics
  • for word tokenization I used the simple Python's .split() function, don't know whether or not this is acceptable for Swedish language
  • there are some duplicate ids (such as se/habo-kommun) that still need fixing. Is this correct on the source data, i.e., are these really two separate distinct public bodies?

@mattiasaxell
Copy link

@augusto-herrmann Great. I have checked and I believe the split function may be OK, looks good to me at least. @peterk do you know if Python's .split() function is OK for Swedish language?

@augusto-herrmann They are correct. The duplicate ids like habo is there because there is Habo Kommun (Municipality) and Håbo Kommun. I'm suggesting to change Håbo to se/haabo-kommun.

@todrobbins
Copy link
Contributor

@augusto-herrmann @mattiasaxell 👍 This looks great. Thanks for solving this Augusto!

Pending @peterk's review, I think we're ready to merge.

@peterk
Copy link

peterk commented Feb 6, 2016

@mattiasaxell @todrobbins split() is fine for word tokenization. Please note that you may end up with dupes if you do closest latin char substitution. Also I noticed there may be some need for data normalization in the url and org id fields.

@todrobbins
Copy link
Contributor

@augusto-herrmann @mattiasaxell @peterk I'm going to review the URL normalization and commit/merge accordingly.

@augusto-herrmann
Copy link
Collaborator Author

One thing we should take into account when assigning ids based on an organization's name is how to keep track of changes in its name and structure (as discussed on #68). Just something to keep in mind. I'm having difficulties with this while trying to update the Brazilian data (#72).

@rufuspollock
Copy link
Member

@augusto-herrmann good point - if the org name changes, is it the same? My sense would be to say "no" in some sense.

@augusto-herrmann
Copy link
Collaborator Author

@augusto-herrmann They are correct. The duplicate ids like habo is there because there is Habo Kommun (Municipality) and Håbo Kommun. I'm suggesting to change Håbo to se/haabo-kommun.

What about "Rättshjälpsmyndigheten", @mattiasaxell ? I believe this line is really duplicated, considering that both entries share the same email address. They have different Boxes in the address field, but overall the second entry contain less empty fields. I propose we keep only the second line there for now, until we find a way to update the data from other data sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Data sources and ingestion automation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants