Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instructions for data contributors #23

Closed
rufuspollock opened this issue May 4, 2013 · 22 comments
Closed

Instructions for data contributors #23

rufuspollock opened this issue May 4, 2013 · 22 comments

Comments

@rufuspollock
Copy link
Member

This should probably go on the wiki once finished.

Fields

  • key names:
    • should be url suitable: alphanumeric + '-' only
    • use - rather than _
    • use abbreviations where appropriate
  • use iso formatted date / times

To discuss

  • Do we need last modified and created?
  • Do we want both parent and parent_key?

What Public Bodies

  • National or local departments or agencies
  • (Probably) Not every school of fire station in existence.

Asides

  • Write up a description of the columns
@davidread
Copy link

key - should be permanent, for humans (i.e. not hex)

last modified/created - I'm not sure what the use case for these are. I wonder if we can analyse git data to get these values automatically, rather than rely on changers to update these values?

parent_key - I agree it seems to supersede parent. The presence of the 'parent' column may only be due to issues with the german data - e.g. "Heeresführungskommando (Deutschland)" being the parent of several departments, but not existing in itself.

category - following the german example, this should be a 'type' of body in government. e.g. ministerial department, unitary authority, executive agency, NDPB, non-public. As mentioned elsewhere, it would be good to be able to list these for each country and give some definition. Since the threshold of public / non-public is often hazy, it would be good to include in the data bodies that are in the grey area, but after consideration they have been decided as 'non-public'.

tags - are these for subjects, like 'health'? WDTK puts categories in here and I think we should use category instead.

jurisdiction - this is block filled with the country name. What's the thinking behind this?

email address - many bodies have a separate contact address for enquiries and FOI requests. If both are available, would the former be preferable?

source_url - ideally a unique URL, to allow matching in future. Can this be added for the WDTK data?

@rufuspollock
Copy link
Member Author

@davidread like all of these. If you want to create an instructions patch to the README that would be swiftly merged. Still not sure of tags versus category versus type. I'd rather not have tags i think and just have e.g. type and category.

Happy to have source_url in but it should be real source url.

@jpmckinney
Copy link

key - should be permanent, for humans (i.e. not hex)

Unfortunately, it's not possible for a public body identifier to be both permanent and for humans, because public bodies change names over time (unless you are willing to accept keys that no longer match the body's present name). There is unfortunately no property of a public body that does not change over time, so the only future-proof identification scheme is to use opaque, non-human-readable IDs.

@rufuspollock
Copy link
Member Author

@jpmckinney agreed. In terms of opaque identifiers options are:

  • classic autoincrement (like it because short)
  • uuid (abbeviated?) easy to generate
    • full uuid is very long and somewhat offputting. Would like to limited to 5-6 characters max (which maybe makes it less useful)
  • more complex: (along lines of entity id proposal cf Organisation identifiers (for discussion) #41) ocd:person:{jurisdiction}:{key}

wdyt?

@jpmckinney
Copy link

  • Classic autoincrement requires more logic and makes distributed efforts more difficult as they would need to communicate to get the next identifier.
  • UUIDs are used to ensure uniqueness. Truncated UUIDs are much more likely to collide. A 6-character UUID gives you only ~16 million identifiers, which is less than the number of orgs in OpenCorporates.
  • In this case, something more complex like an entity resolution service is not needed. (Indeed, this repo is sort of acting as the "central authority" in that scenario.)

UUID is 36 characters long. I'd like to understand why identifiers must be short. Also, who is looking at these identifiers, and why would they find them offputting? Very few people should have to look at these identifiers, I think.

@davidread
Copy link

It would be a great shame to lose human-readable URIs, because of the need for permanent URIs when coping with a name change. Perhaps just using redirects from the old name to the new name would solve the issue.

Human-readable URIs are a recommendation of this work, based on plenty of experience with maintaining URIs: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/60975/designing-URI-sets-uk-public-sector.pdf

@jpmckinney
Copy link

The trouble is that it's not clear that the "key" is going to be a URI. You cannot do redirects on strings. We would also need a commitment from OKF to offer the redirect service and to track the history of each organization. My proposal allows this project to keep the same scope as it currently has, without having to author any new database management tools.

Names are also not unique. How do you handle that?

The linked document has two principles, but no further content about these principles:

  • A URI structure MUST not contain anything that could change, such as session IDs
  • A URI path structure SHOULD be readable so that a human has a reasonable understanding of its contents

The first principle is a requirement, and seems to contradict using names in URIs, since names can change. The second is a recommendation. Yes, it's something to strive for, but there's an understanding that it's not always achievable.

@rufuspollock
Copy link
Member Author

@jpmckinney short ids are easily usable in urls - which is a primary use case here. I understand the attraction of uuids but they are a real issue in urls from a ux perspective (which matters in my experience). Frankly I think one is flexible here so if e.g. the folks doing canada use uuids that would be fine but I think there is a reason for short and usable. I also wonder if say an org changes name we wouldn't want to create a new entry for them and mark the old entity as "inactive" or similar.

@jpmckinney
Copy link

If an org changes name, there is no org that becomes inactive. That approach doesn't reflect reality. It's still the same org, and its identifier should not change. If the org changed logo, we wouldn't consider marking it inactive and creating a new one. The only reason that approach is under consideration is because there is an inclination towards using names in identifiers.

I don't see why an implementation (say, a web app) couldn't augment an organization's record with a human readable slug to be used in the URL. But I think in terms of coordinating efforts from various groups who care about these organizations, having a stable, unique, long-lasting identifier is preferable to all other options proposed thus far. At any rate, that's what Sunlight is moving forward with, and mySociety has adopted the same strategy.

I care about UX, but I don't see why it's important in terms of UX for the identifier for a record in a URL to be human-readable. Twitter use 18-character machine IDs and yet millions of people use Twitter daily without getting incredibly confused and frustrated, e.g. https://twitter.com/OKFN/status/377804247284596737 Can you describe the issue?

@jpmckinney
Copy link

For what it's worth, I tested my dataset for public bodies in Alberta, and the average length of a public body's name is 31 characters. The median is 27. So, brevity is definitely not the right criteria here (UUIDs are 36 chars, 32 if you choose to remove the dashes for brevity).

Update: In the current data in this repo, the average length key is 28 chars (median 26). I don't think we're arguing over +/- 5 characters.

@rufuspollock
Copy link
Member Author

@jpmckinney its not about brevity per-se but about random character brevity. 36 (or 24 char) uuids are quite offputting.

To give some more context, the plan would have been to have:

{site}/{jurisdiction}/{id}/{name/title-as-hyphenated-name}

This is similar to stackoverflow or many other sites in having an id plus something readable ...

@jpmckinney
Copy link

That URL pattern looks fine. I understand the desire for a short ID part, but I don't see why 36 chars is "offputting". Is there anything besides personal biases to justify a decision?

On any screen that would display a full URL to the user, a 36-char ID leaves plenty of room for readable strings.

(Except when the user is typing in a URL, mobiles and small tablets rarely display URLs in browsers, to save space and also because most URLs wouldn't fit on the screen anyway.)

@jpmckinney
Copy link

screen shot 2013-09-11 at 1 03 10 pm

@rufuspollock
Copy link
Member Author

@jpmckinney I understand what you say but just look at many "good" apps whether its trello, stackoverflow, datamapper etc. Short ids are attractive. I say this having myself frequently made your exact argument and implemented uuids in apps. My conclusion has been it was generally a mistake and would have preferred something shorter ...

@jpmckinney
Copy link

Yes, but short IDs are also possible in those contexts. They have centralized database management/record creation - there is no need to guarantee the uniqueness of an identifier (which is the only reason UUIDs are being considered here). In "good" decentralized apps the IDs are very long, e.g. GitHub's 40-char commit IDs https://github.com/okfn/publicbodies/commit/24103cd2e17addab50d318a4fe62f9c63954d545

I don't know what apps you used UUIDs in, but if they were centralized, then yes, UUIDs were overkill.

@davidread
Copy link

@jpmckinney The twitter example is not useful - of course a message doesn't have a readable slug - it has no title. And it's just for someone to click on - it is a URL not a URI. Indeed their usernames are readable.

You need to consider real use cases, such as someone putting together a CSV of say department budgets. e.g.

body,staff,budget
http://publicbodies.org/org/ministry-of-defence,164663,20800000000
http://publicbodies.org/org/department-of-justice,1035,235000000

It's just not best practice to use hex IDs when things have names. That's what we do in databases, and isn't useful when linking data. Yes readability requires a little thinking about lifecycle, but it's quite the norm for web data.

@jpmckinney
Copy link

@davidread The Twitter example was just to show that brevity is not universally adopted when it comes to URIs. In any case, we've since established that URI-brevity is not the issue but rather random-character-brevity.

I would have assumed that you were already familiar, but just in case: it's not best practice to use something that changes as an ID. That's why people use autoincrement integers in database systems like MySQL as the primary key, instead of a person's name, for example, because a person's name can change.

A public body's name can also change. What do you do if it does? That's the question that so far has no good answers in the scenario where the name is used as the primary key. Responses to answers given so far:

  • Deleting and creating records due to a simple name change makes no more sense than deleting and creating records due to a simple logo change.

  • If you keep the same primary key when a name change occurs, then it is no longer in sync with the organization's name. I assume we want to use names in IDs so that people can (1) guess the ID of a public body and (2) read the ID and identify the public body. If the ID is not in sync with the name, then people (1) cannot guess the ID and (2) cannot read the ID and identify the public body.

    In other words, if you keep the same primary key, then in a matter of years those keys will be as uninformative as random strings - and certainly more confusing, as a person knows that a random string is not trying to communicate anything, but a person would try to guess what a human-readable string is trying to communicate.

  • If you change the primary key on the record itself when a name change occurs, then all foreign keys are broken. No one's proposed this yet, but just in case it was on anyone's mind.

Linked data has no problem with non-human-readable URIs. I don't see what special challenge it creates. On the other hand, using names does have special challenges. For example, in Canada at the federal level, all major public bodies have four names, e.g.:

  • Department of Natural Resources
  • Ministère des Ressources naturelles
  • Natural Resources Canada
  • Ressources naturelles Canada

Now, maybe you arbitrarily decide that English is superior to French (however, Canada is in fact bilingual and no languages takes precedence in the federal government). But you still then have two names. One is the actual legal name which is rarely seen. The other name is given by the Federal Identity Program and which is seen everywhere. Which name do you think people will assume would match the ID?

So, yes, I am thinking through real use cases and thinking about the long term.

@paultag
Copy link

paultag commented Sep 12, 2013

So, just to speak a bit about why we used UUIDs for universal IDs:

Basically, it comes down to needing a way to create IDs (in a distributed way) with a very low chance of collision. The idea is that a given OCD ID shall be unique, but trying to get everyone to keep in sync with a central datastore is just untenable.

So, what do we have left? We have hashing enough things, and putting enough entropy into the hashing to end up with IDs that are as unique as we can get them - which is just re-inventing UUID.

UUIDs are fairly ubiquitous, and really easy to generate in most languages, so they were a pretty natural choice.

For things that can have a set of attributes that are unique enough (or centrally store-able, such as Geo-IDs), we use a non-opaque ID.

Keep in mind, you can implement short-IDs (in a complex world - base64 encode (or even hex encode! - hex(int(uuid.uuid1())) isn't so bad!) the string, drop the == and add information in your slug, in a simple world, chop bits off (but that's not great)), and stuff all data into your slug that you need (for Humans to read the URLs)

Anyway. I don't think the use of UUID is a huge concern, it provides a safe, well-tested (and well-adopted) way of generating a globally unique (read: 256 exabytes of space in the UUID range) ID.

@rufuspollock
Copy link
Member Author

base64 trick only gets you down to 24 chars (2/3 i agree but still poor). I do hear about the distributed generation but i also know about the ux (which seems minor but ends up being major).

One alternative here is that groups get their own namespace (that's essentially what domain names do for uris / urls right ...). It would then be up to a given group to make sure they were unique in their namespace. So e.g. publicbodies would have:

/pb/{pb-id}

or if you want to go the url route we could use those as ids and have

publicbodies.org/id/...

@davidread
Copy link

@jpmckinney I'm well aware of philosophy of primary keys at the database level - this I don't dispute. However, the web community simply prefers to expose real names. Yes of course there is a bit of faff - just look at Wikipedia's article name disambiguation, redirects etc - but the broad opinion is that readable names are worth it.

And choosing between French and English is hardly something I care much about - they are both pretty readable. Hex is a pain.

However if you don't think it is something that you want us to add to this project, then I'll live with that. Hex IDs is better than no IDs.

@jpmckinney
Copy link

@davidread Sure, you can sign OKF up for all that disambiguation, redirection, etc. I'd say it's too much "faff" for a simple project like this one; you say it's worth it. I understand that you don't care what language is chosen, but users from Canada will - I guess in your system you would just have more redirects to handle both languages.

@rgrp Can you point to any UX research regarding the point at which the number of random characters becomes a major UX issue? Also, I think we're just disagreeing as to the length of the random character string, not about whether to have opaque or readable URLs? (viz #23 (comment))

Anyway, I'll follow Rufus' suggestion and just use UUIDs for all of Canada.

@rufuspollock
Copy link
Member Author

With the datapackage.json descriptions updated in #29 and likely no final consensus on ids ;-) I'm coming to call this issue closed for the present. Feel free to re-open if you feel it is merited!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants