Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoL Identifier style #491

Closed
mdoering opened this issue Sep 23, 2019 · 22 comments
Closed

CoL Identifier style #491

mdoering opened this issue Sep 23, 2019 · 22 comments

Comments

@mdoering
Copy link
Member

mdoering commented Sep 23, 2019

For the primary objects (taxa, names, references) we should assign stable identifiers across released versions. Identifiers in the Clearinghouse and CoL are of type text/string and can in theory be anything we'd like them to be.

Integers offer a much smaller memory footprint and are useful for keeping all data in memory, e.g. when assembling. Identifiers will be considered unique within a dataset only and do not need to be globally unique like UUIDs or URIs. If a context mandates them to be globally unique they should be either prefixed by a col namespace, e.g. in a written publication. Or be added to a col base URI/resolver like http://catalogueoflife.org/name/123456. The URI as such should not be used as the id in the strict sense as it would prevent us from changing the URI/resolver/domain over time easily. http://catalogueoflife.org is a rather long domain already.

Having really short ids at hand is useful for humans to memorize and to print/render without taking up too much space. Encoding integers into a different numerical system with a higher base/radix would reduce string length, e.g. hexadecimal, 22 or 26 latin characters plus 10 numbers (i.e. latin 32/36) or BASE64 which is case sensitive and uses + and / to reach 64 unique characters. Examples:

int hex latin29  latin32  latin36  Base64 proquint
base10 base16 base29 base32  base36 base64 32bit
18 12 N L I S babab-babif
1089 441 3BL 343 U9 RB babab-bidad
1781089 1B2D61 4K2TW 3QDD3 126AP Gy1h babir-fujod
4781089 48F421 8S326 6KX33 2UH41 SPQh badam-zibod
12781089 C30621 N43J8 E83K3 7LXY9 wwYh bagag-bimod
2147483647 7FFFFFFF 5MQ9CB9 3ZZZZZZ ZIK0ZJ B///// luzuz-zuzuz
@timrobertson100
Copy link
Contributor

timrobertson100 commented Sep 23, 2019

One issue to consider is it can be difficult (impossible?) to distinguish between 0,O and 1,I,l etc if encoded in a latin character set; especially so if across different fonts. This can be avoided by removing certain characters (1,I,l,O,0 etc) from the palette

There have been studies on readability for this kind of thing - personally I find it easiest with numbers (e.g. my bank account, IP addresses) rather than encoded versions (e.g. copying GBIF DOIs).

@mdoering
Copy link
Member Author

I have added a Latin32 encoding that does not contain the ambiguous characters 1I0O.
Looks light a flight booking code now :)

@mdoering
Copy link
Member Author

mdoering commented Sep 24, 2019

pronouncable proquints are an interesting solution. The 32bit version of having a fixed length of 2 times 5 character words, each consisting of a cvcvc consonant (c) vowel (v) sequence.

Examples: lusab-babad, gutih-tugad, gutuk-bisog or mudof-sakat

A single 7 char word of the form cvcvcvc .e.g. gutukis has 22 bit=4.1 million options, 8 chars cvcvcvcv .e.g. gutukiso with 24 bits enough for 16 million.

@timrobertson100
Copy link
Contributor

Suggest you plan for extensibility as 16M is not a lot.
<=5 letter groups are also easier to read than e.g. 7 character groups. Perhaps consider 2 groupings of 4 chars knowing you can grow to 2 groupings of 5 chars, and then 3 groupings of 4 etc?

@gdower
Copy link
Contributor

gdower commented Sep 24, 2019

CoL has been criticized in the past for not having stable, resolvable IDs. I'd suggest adding a name space prefix for the GSDs so that IDs are unique and possibly could be accessed by URL or with a ID resolver service.

I agree that it's important to not use ambiguous characters (0 O I 1, etc.) in case URLs are published in print publications.

I'd recommend not using proquints, because it seems like some of them could be confused as scientific names and possibly some shorter scientific names could even be coincidentally replicated as an ID for the wrong taxon (e.g. Biton velox as biton-velox), which will confuse people especially if the wrong taxon page shows up as Google search result from Biton velox keywords in the URL.

Matt suggested that we look at the PURL approach to decoupling IDs from resolvability. It also includes ID prefix name spaces.

@ayco-at-naturalis
Copy link
Contributor

ayco-at-naturalis commented Sep 25, 2019 via email

@mdoering
Copy link
Member Author

As much as I think PURLs do a good job compared to all the other global identifers I really do not want URLs to be the real identifier. We should definitely define a resolution service (API is in fact one already), but the IDs themselves should be decoupled. As long as we have short and stable ids we can use them in any context and technology easily. We can provide a resolution service that returns JSON, HTML, LD, TXT or whatever comes next. But dealing with URLs locks you unnecessarily into a cage.

@ayco-at-naturalis the point of stable ids is that they do not change across versions unless they are really used for different things. @gdower I agree with you against proquints for that very reason. I was wondering myself already if its wise to have a well pronouncable string that ends up being used like names.

That to me leaves the options of classic integers or Latin32 with the ambiguous characters removed

@mdoering
Copy link
Member Author

@gdower obviously there is always the chance that some id will represent real scientific genera names. ABIES exists in Latin32 or any of the other ones that include alphabetical characters.

@timrobertson100
Copy link
Contributor

I really do not want URLs to be the real identifier

+1
The key decision is what value should be held on the database table. That decision can be decoupled from external formatting of the ID (e.g. URL, URN etc), serialization formats (e.g. how content negotiation could/should be supported) and resolution (e.g. PURL, DOI, LSID).

I'd recommend not using proquints, because it seems like some of them could be confused as scientific names

Excellent point. Vernaculars too (blue-tit) and across languages (sol-sort Danish for blackbird) etc.

@mdoering
Copy link
Member Author

The latin32 encoding is the favorite at this point and will be implemented

@mdoering
Copy link
Member Author

We need to reopen the issue as there are two issues with the latin32 charset to generate identifiers:

  1. we create offensive words like FUCK, ANAL, ARSE
  2. we generate organism names, notably genera, that are used as identifiers for other taxa, e.g. PUMA, CAREX, ABIES

Avoiding manually selected identifiers from a deny list is an option, but will always miss some entries and is very difficult to maintain across any language. A simpler solution would be to drop all vowels which are essential in any language to form words. These are just 5 chars less, so we would end up with latin27 instead

@mdoering mdoering reopened this Dec 11, 2020
@mdoering
Copy link
Member Author

See also https://stackoverflow.com/questions/956556/is-it-irrational-to-sanitize-random-character-strings-for-curse-words

Apparently Microsoft omits the following from their product keys:

0 1 2 5 A E I O U L N S Z

@timrobertson100
Copy link
Contributor

I read in various places that Microsoft drop 0 1 2 5 A E I O U L N S Z from their product keys. This may be safer than simply vowels.

@gdower
Copy link
Contributor

gdower commented Dec 11, 2020

What if we put a number between every letter? I guess it's still potentially offensive though? 2F4U6C7K9 1F1U1C1K1

@mdoering
Copy link
Member Author

Then things get much harder to en/decode. The beauty with just the alphabet is that you can easily convert back and forth to an integer. I want to keep that

@mdoering
Copy link
Member Author

Microsoft drop 0 1 2 5 A E I O U L N S Z from their product keys

we already drop 0O and 1I in latin32. I reckon they decided to drop 5S and 2Z because they are hard to distinguish if a human needs to read the key with bad eyes. But LN? Dropping the vowels in addition to 0O and 1I as in latin32 is good safe and enough.

@timrobertson100
Copy link
Contributor

timrobertson100 commented Dec 11, 2020

Just to mention there are also IDs like PUMA, DUCK which are not ideal, and names like MATT that would also be removed by stripping vowels.

@olafbanki
Copy link

Agree to stripping vowels. @mdoering if new IDs need to be re-issued this should be done at the earliest convenience before users start to use the new API more heavily.

@mdoering
Copy link
Member Author

Agree. I will do this monday first thing then. Do you, @chantalhuijbers or @dhobern want to send out a quick communication that the IDs will have to be changed on monday and should not be regarded as stable until then? (temporary) blog post & API mailing list maybe? Or at least to Niels...

@olafbanki
Copy link

Sounds good Markus, many thanks

@gdower
Copy link
Contributor

gdower commented Dec 12, 2020

I might need to re-run conversion again to generate the new ID map. That would mean that the new ID mapping would be available sometime on Tuesday.

@mdoering
Copy link
Member Author

mdoering commented Jan 21, 2021

we implemented what we call now LATIN29 in the code, i.e. LATIN32 minus the vocals resulting in the following 29 case insensitive chars:

23456789BCDFGHJKLMNPQRSTVWXYZ

Adding examples to the top list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants