CoL Identifier style #491

mdoering · 2019-09-23T12:53:38Z

For the primary objects (taxa, names, references) we should assign stable identifiers across released versions. Identifiers in the Clearinghouse and CoL are of type text/string and can in theory be anything we'd like them to be.

Integers offer a much smaller memory footprint and are useful for keeping all data in memory, e.g. when assembling. Identifiers will be considered unique within a dataset only and do not need to be globally unique like UUIDs or URIs. If a context mandates them to be globally unique they should be either prefixed by a col namespace, e.g. in a written publication. Or be added to a col base URI/resolver like http://catalogueoflife.org/name/123456. The URI as such should not be used as the id in the strict sense as it would prevent us from changing the URI/resolver/domain over time easily. http://catalogueoflife.org is a rather long domain already.

Having really short ids at hand is useful for humans to memorize and to print/render without taking up too much space. Encoding integers into a different numerical system with a higher base/radix would reduce string length, e.g. hexadecimal, 22 or 26 latin characters plus 10 numbers (i.e. latin 32/36) or BASE64 which is case sensitive and uses + and / to reach 64 unique characters. Examples:

int	hex	latin29	latin32	latin36	Base64	proquint
base10	base16	base29	base32	base36	base64	32bit
18	12	N	L	I	S	babab-babif
1089	441	3BL	343	U9	RB	babab-bidad
1781089	1B2D61	4K2TW	3QDD3	126AP	Gy1h	babir-fujod
4781089	48F421	8S326	6KX33	2UH41	SPQh	badam-zibod
12781089	C30621	N43J8	E83K3	7LXY9	wwYh	bagag-bimod
2147483647	7FFFFFFF	5MQ9CB9	3ZZZZZZ	ZIK0ZJ	B/////	luzuz-zuzuz

The text was updated successfully, but these errors were encountered:

timrobertson100 · 2019-09-23T13:24:08Z

One issue to consider is it can be difficult (impossible?) to distinguish between 0,O and 1,I,l etc if encoded in a latin character set; especially so if across different fonts. This can be avoided by removing certain characters (1,I,l,O,0 etc) from the palette

There have been studies on readability for this kind of thing - personally I find it easiest with numbers (e.g. my bank account, IP addresses) rather than encoded versions (e.g. copying GBIF DOIs).

mdoering · 2019-09-24T11:36:23Z

I have added a Latin32 encoding that does not contain the ambiguous characters 1I0O.
Looks light a flight booking code now :)

mdoering · 2019-09-24T15:13:28Z

pronouncable proquints are an interesting solution. The 32bit version of having a fixed length of 2 times 5 character words, each consisting of a cvcvc consonant (c) vowel (v) sequence.

Examples: lusab-babad, gutih-tugad, gutuk-bisog or mudof-sakat

A single 7 char word of the form cvcvcvc .e.g. gutukis has 22 bit=4.1 million options, 8 chars cvcvcvcv .e.g. gutukiso with 24 bits enough for 16 million.

timrobertson100 · 2019-09-24T16:01:55Z

Suggest you plan for extensibility as 16M is not a lot.
<=5 letter groups are also easier to read than e.g. 7 character groups. Perhaps consider 2 groupings of 4 chars knowing you can grow to 2 groupings of 5 chars, and then 3 groupings of 4 etc?

gdower · 2019-09-24T22:22:01Z

CoL has been criticized in the past for not having stable, resolvable IDs. I'd suggest adding a name space prefix for the GSDs so that IDs are unique and possibly could be accessed by URL or with a ID resolver service.

I agree that it's important to not use ambiguous characters (0 O I 1, etc.) in case URLs are published in print publications.

I'd recommend not using proquints, because it seems like some of them could be confused as scientific names and possibly some shorter scientific names could even be coincidentally replicated as an ID for the wrong taxon (e.g. Biton velox as biton-velox), which will confuse people especially if the wrong taxon page shows up as Google search result from Biton velox keywords in the URL.

Matt suggested that we look at the PURL approach to decoupling IDs from resolvability. It also includes ID prefix name spaces.

ayco-at-naturalis · 2019-09-25T06:40:30Z

We could make them (globally) unique across releases by including the release version in the ID

…

On Mon, 23 Sep 2019 at 14:53, Markus Döring ***@***.***> wrote: For the primary objects (taxa, names, references) we should assign as much as possible stable identifiers across released versions. Identifiers in the Clearinghouse and CoL are of type text/string and can in theory be anything we'd like them to be. Integers offer a much smaller memory footprint and are really useful for keeping all data in memory. Identifiers will be considered unique within a dataset only do not need to be globally unique like UUIDs or URIs. If they ought to be globally unique they should be either prefixed by a fake namespace col: if the context is clear, e.g. in a written publication. Or be added to a col base URI/resolver. The URI as such should not be used as the id in the strict sense as it allows us to change the URI/resolver/domain over time easily. http://catalogueoflife.org is a rather long domain already. Having really short ids at hand is useful for humans to memorize and to print/render without taking up too much space. Encoding integers into a different numerical system with a higher base/radix would reduce string length, e.g. hexadecimal, all 26 latin characters plus 10 numbers or BASE64 <https://en.wikipedia.org/wiki/Base64#Base64_table> which is all case sensitive latin chars plus + and /. | int | hex | latin+num | Base64 | | 10 | 16 | 36 | 64 | | -- | -- | -- | -- | | 1.089 |441|U9|h1| |1.781.089|1B2D61|126AP|6ORx| |12.781.089|C30621|7LXY9|MMox| Recommendation is to use integers for internal calculations and expose them as BASE64 strings using - and _ instead of + and / so they do not need any URL encoding. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#491>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXJ6P654L7H5BDPHHNZCL3QLC35HANCNFSM4IZKOELQ> .

-- Met vriendelijke groet, Ayco Holleman Lead Programmer +31717519245 - - ayco.holleman@naturalis.nl - www.naturalis.nl Darwinweg 2, 2333 CR Leiden Postbus 9517, 2300 RA Leiden <https://www.naturalis.nl/over-ons> <https://www.naturalis.nl/lang-leve>

mdoering · 2019-09-25T07:53:23Z

As much as I think PURLs do a good job compared to all the other global identifers I really do not want URLs to be the real identifier. We should definitely define a resolution service (API is in fact one already), but the IDs themselves should be decoupled. As long as we have short and stable ids we can use them in any context and technology easily. We can provide a resolution service that returns JSON, HTML, LD, TXT or whatever comes next. But dealing with URLs locks you unnecessarily into a cage.

@ayco-at-naturalis the point of stable ids is that they do not change across versions unless they are really used for different things. @gdower I agree with you against proquints for that very reason. I was wondering myself already if its wise to have a well pronouncable string that ends up being used like names.

That to me leaves the options of classic integers or Latin32 with the ambiguous characters removed

mdoering · 2019-09-25T07:55:13Z

@gdower obviously there is always the chance that some id will represent real scientific genera names. ABIES exists in Latin32 or any of the other ones that include alphabetical characters.

timrobertson100 · 2019-09-25T08:07:43Z

I really do not want URLs to be the real identifier

+1
The key decision is what value should be held on the database table. That decision can be decoupled from external formatting of the ID (e.g. URL, URN etc), serialization formats (e.g. how content negotiation could/should be supported) and resolution (e.g. PURL, DOI, LSID).

I'd recommend not using proquints, because it seems like some of them could be confused as scientific names

Excellent point. Vernaculars too (blue-tit) and across languages (sol-sort Danish for blackbird) etc.

mdoering · 2020-02-24T10:04:27Z

The latin32 encoding is the favorite at this point and will be implemented

mdoering · 2020-12-11T16:56:45Z

We need to reopen the issue as there are two issues with the latin32 charset to generate identifiers:

we create offensive words like FUCK, ANAL, ARSE
we generate organism names, notably genera, that are used as identifiers for other taxa, e.g. PUMA, CAREX, ABIES

Avoiding manually selected identifiers from a deny list is an option, but will always miss some entries and is very difficult to maintain across any language. A simpler solution would be to drop all vowels which are essential in any language to form words. These are just 5 chars less, so we would end up with latin27 instead

mdoering · 2020-12-11T16:58:59Z

See also https://stackoverflow.com/questions/956556/is-it-irrational-to-sanitize-random-character-strings-for-curse-words

Apparently Microsoft omits the following from their product keys:

0 1 2 5 A E I O U L N S Z

timrobertson100 · 2020-12-11T17:00:01Z

I read in various places that Microsoft drop 0 1 2 5 A E I O U L N S Z from their product keys. This may be safer than simply vowels.

gdower · 2020-12-11T17:05:09Z

What if we put a number between every letter? I guess it's still potentially offensive though? 2F4U6C7K9 1F1U1C1K1

mdoering · 2020-12-11T17:06:28Z

Then things get much harder to en/decode. The beauty with just the alphabet is that you can easily convert back and forth to an integer. I want to keep that

mdoering · 2020-12-11T17:07:57Z

Microsoft drop 0 1 2 5 A E I O U L N S Z from their product keys

we already drop 0O and 1I in latin32. I reckon they decided to drop 5S and 2Z because they are hard to distinguish if a human needs to read the key with bad eyes. But LN? Dropping the vowels in addition to 0O and 1I as in latin32 is good safe and enough.

timrobertson100 · 2020-12-11T17:28:32Z

Just to mention there are also IDs like PUMA, DUCK which are not ideal, and names like MATT that would also be removed by stripping vowels.

olafbanki · 2020-12-12T08:33:52Z

Agree to stripping vowels. @mdoering if new IDs need to be re-issued this should be done at the earliest convenience before users start to use the new API more heavily.

mdoering · 2020-12-12T11:51:41Z

Agree. I will do this monday first thing then. Do you, @chantalhuijbers or @dhobern want to send out a quick communication that the IDs will have to be changed on monday and should not be regarded as stable until then? (temporary) blog post & API mailing list maybe? Or at least to Niels...

olafbanki · 2020-12-12T12:06:54Z

Sounds good Markus, many thanks

gdower · 2020-12-12T18:37:56Z

I might need to re-run conversion again to generate the new ID map. That would mean that the new ID mapping would be available sometime on Tuesday.

mdoering · 2021-01-21T10:09:12Z

we implemented what we call now LATIN29 in the code, i.e. LATIN32 minus the vocals resulting in the following 29 case insensitive chars:

23456789BCDFGHJKLMNPQRSTVWXYZ

Adding examples to the top list

mdoering added this to the Monthly Editions Live milestone Nov 5, 2019

mdoering modified the milestones: Monthly Editions Live, Extended Catalogue Build Dec 5, 2019

mdoering closed this as completed Feb 24, 2020

mdoering mentioned this issue Aug 23, 2020

Define objective rules for taxon concept identity CatalogueOfLife/general#6

Open

mdoering reopened this Dec 11, 2020

mdoering mentioned this issue Mar 11, 2021

LSIDs for taxonomic names live again tdwg/tnc#117

Open

mdoering closed this as completed Mar 11, 2021

mdoering mentioned this issue Mar 11, 2021

Create stable ids during CoL release #222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoL Identifier style #491

CoL Identifier style #491

mdoering commented Sep 23, 2019 •

edited

Loading

timrobertson100 commented Sep 23, 2019 •

edited

Loading

mdoering commented Sep 24, 2019

mdoering commented Sep 24, 2019 •

edited

Loading

timrobertson100 commented Sep 24, 2019

gdower commented Sep 24, 2019

ayco-at-naturalis commented Sep 25, 2019 via email

mdoering commented Sep 25, 2019

mdoering commented Sep 25, 2019

timrobertson100 commented Sep 25, 2019

mdoering commented Feb 24, 2020

mdoering commented Dec 11, 2020

mdoering commented Dec 11, 2020

timrobertson100 commented Dec 11, 2020

gdower commented Dec 11, 2020

mdoering commented Dec 11, 2020

mdoering commented Dec 11, 2020

timrobertson100 commented Dec 11, 2020 •

edited

Loading

olafbanki commented Dec 12, 2020

mdoering commented Dec 12, 2020

olafbanki commented Dec 12, 2020

gdower commented Dec 12, 2020

mdoering commented Jan 21, 2021 •

edited

Loading

CoL Identifier style #491

CoL Identifier style #491

Comments

mdoering commented Sep 23, 2019 • edited Loading

timrobertson100 commented Sep 23, 2019 • edited Loading

mdoering commented Sep 24, 2019

mdoering commented Sep 24, 2019 • edited Loading

timrobertson100 commented Sep 24, 2019

gdower commented Sep 24, 2019

ayco-at-naturalis commented Sep 25, 2019 via email

mdoering commented Sep 25, 2019

mdoering commented Sep 25, 2019

timrobertson100 commented Sep 25, 2019

mdoering commented Feb 24, 2020

mdoering commented Dec 11, 2020

mdoering commented Dec 11, 2020

timrobertson100 commented Dec 11, 2020

gdower commented Dec 11, 2020

mdoering commented Dec 11, 2020

mdoering commented Dec 11, 2020

timrobertson100 commented Dec 11, 2020 • edited Loading

olafbanki commented Dec 12, 2020

mdoering commented Dec 12, 2020

olafbanki commented Dec 12, 2020

gdower commented Dec 12, 2020

mdoering commented Jan 21, 2021 • edited Loading

mdoering commented Sep 23, 2019 •

edited

Loading

timrobertson100 commented Sep 23, 2019 •

edited

Loading

mdoering commented Sep 24, 2019 •

edited

Loading

timrobertson100 commented Dec 11, 2020 •

edited

Loading

mdoering commented Jan 21, 2021 •

edited

Loading