-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoL Identifier style #491
Comments
One issue to consider is it can be difficult (impossible?) to distinguish between 0,O and 1,I,l etc if encoded in a latin character set; especially so if across different fonts. This can be avoided by removing certain characters (1,I,l,O,0 etc) from the palette There have been studies on readability for this kind of thing - personally I find it easiest with numbers (e.g. my bank account, IP addresses) rather than encoded versions (e.g. copying GBIF DOIs). |
I have added a |
pronouncable proquints are an interesting solution. The 32bit version of having a fixed length of 2 times 5 character words, each consisting of a cvcvc consonant (c) vowel (v) sequence. Examples: A single 7 char word of the form cvcvcvc .e.g. |
Suggest you plan for extensibility as 16M is not a lot. |
CoL has been criticized in the past for not having stable, resolvable IDs. I'd suggest adding a name space prefix for the GSDs so that IDs are unique and possibly could be accessed by URL or with a ID resolver service. I agree that it's important to not use ambiguous characters (0 O I 1, etc.) in case URLs are published in print publications. I'd recommend not using proquints, because it seems like some of them could be confused as scientific names and possibly some shorter scientific names could even be coincidentally replicated as an ID for the wrong taxon (e.g. Biton velox as Matt suggested that we look at the PURL approach to decoupling IDs from resolvability. It also includes ID prefix name spaces. |
We could make them (globally) unique across releases by including the
release version in the ID
…On Mon, 23 Sep 2019 at 14:53, Markus Döring ***@***.***> wrote:
For the primary objects (taxa, names, references) we should assign as much
as possible stable identifiers across released versions. Identifiers in the
Clearinghouse and CoL are of type text/string and can in theory be anything
we'd like them to be.
Integers offer a much smaller memory footprint and are really useful for
keeping all data in memory. Identifiers will be considered unique within a
dataset only do not need to be globally unique like UUIDs or URIs. If they
ought to be globally unique they should be either prefixed by a fake
namespace col: if the context is clear, e.g. in a written publication. Or
be added to a col base URI/resolver. The URI as such should not be used as
the id in the strict sense as it allows us to change the
URI/resolver/domain over time easily. http://catalogueoflife.org is a
rather long domain already.
Having really short ids at hand is useful for humans to memorize and to
print/render without taking up too much space. Encoding integers into a
different numerical system with a higher base/radix would reduce string
length, e.g. hexadecimal, all 26 latin characters plus 10 numbers or
BASE64 <https://en.wikipedia.org/wiki/Base64#Base64_table> which is all
case sensitive latin chars plus + and /.
| int | hex | latin+num | Base64 |
| 10 | 16 | 36 | 64 |
| -- | -- | -- | -- |
| 1.089 |441|U9|h1|
|1.781.089|1B2D61|126AP|6ORx|
|12.781.089|C30621|7LXY9|MMox|
Recommendation is to use integers for internal calculations and expose
them as BASE64 strings using - and _ instead of + and / so they do not
need any URL encoding.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#491>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABXJ6P654L7H5BDPHHNZCL3QLC35HANCNFSM4IZKOELQ>
.
--
Met vriendelijke groet,
Ayco Holleman
Lead Programmer
+31717519245 - -
ayco.holleman@naturalis.nl - www.naturalis.nl
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden
<https://www.naturalis.nl/over-ons> <https://www.naturalis.nl/lang-leve>
|
As much as I think PURLs do a good job compared to all the other global identifers I really do not want URLs to be the real identifier. We should definitely define a resolution service (API is in fact one already), but the IDs themselves should be decoupled. As long as we have short and stable ids we can use them in any context and technology easily. We can provide a resolution service that returns JSON, HTML, LD, TXT or whatever comes next. But dealing with URLs locks you unnecessarily into a cage. @ayco-at-naturalis the point of stable ids is that they do not change across versions unless they are really used for different things. @gdower I agree with you against proquints for that very reason. I was wondering myself already if its wise to have a well pronouncable string that ends up being used like names. That to me leaves the options of classic integers or Latin32 with the ambiguous characters removed |
@gdower obviously there is always the chance that some id will represent real scientific genera names. ABIES exists in Latin32 or any of the other ones that include alphabetical characters. |
+1
Excellent point. Vernaculars too ( |
The latin32 encoding is the favorite at this point and will be implemented |
We need to reopen the issue as there are two issues with the latin32 charset to generate identifiers:
Avoiding manually selected identifiers from a deny list is an option, but will always miss some entries and is very difficult to maintain across any language. A simpler solution would be to drop all vowels which are essential in any language to form words. These are just 5 chars less, so we would end up with |
Apparently Microsoft omits the following from their product keys:
|
I read in various places that Microsoft drop |
What if we put a number between every letter? I guess it's still potentially offensive though? |
Then things get much harder to en/decode. The beauty with just the alphabet is that you can easily convert back and forth to an integer. I want to keep that |
we already drop 0O and 1I in latin32. I reckon they decided to drop 5S and 2Z because they are hard to distinguish if a human needs to read the key with bad eyes. But LN? Dropping the vowels in addition to 0O and 1I as in latin32 is good safe and enough. |
Agree to stripping vowels. @mdoering if new IDs need to be re-issued this should be done at the earliest convenience before users start to use the new API more heavily. |
Agree. I will do this monday first thing then. Do you, @chantalhuijbers or @dhobern want to send out a quick communication that the IDs will have to be changed on monday and should not be regarded as stable until then? (temporary) blog post & API mailing list maybe? Or at least to Niels... |
Sounds good Markus, many thanks |
I might need to re-run conversion again to generate the new ID map. That would mean that the new ID mapping would be available sometime on Tuesday. |
we implemented what we call now
Adding examples to the top list |
For the primary objects (taxa, names, references) we should assign stable identifiers across released versions. Identifiers in the Clearinghouse and CoL are of type text/string and can in theory be anything we'd like them to be.
Integers offer a much smaller memory footprint and are useful for keeping all data in memory, e.g. when assembling. Identifiers will be considered unique within a dataset only and do not need to be globally unique like UUIDs or URIs. If a context mandates them to be globally unique they should be either prefixed by a
col
namespace, e.g. in a written publication. Or be added to a col base URI/resolver like http://catalogueoflife.org/name/123456. The URI as such should not be used as the id in the strict sense as it would prevent us from changing the URI/resolver/domain over time easily. http://catalogueoflife.org is a rather long domain already.Having really short ids at hand is useful for humans to memorize and to print/render without taking up too much space. Encoding integers into a different numerical system with a higher base/radix would reduce string length, e.g. hexadecimal, 22 or 26 latin characters plus 10 numbers (i.e. latin 32/36) or BASE64 which is case sensitive and uses
+
and/
to reach 64 unique characters. Examples:The text was updated successfully, but these errors were encountered: