You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The regular expression will match a string like "10/example" or "10/10" (as in ten out of ten, great, A+), which is not valid DOI (prefix does not start with 10.).
In general, cleanDOI() returns the first matching substring if any. I'm not sure whether this is the intended behaviour that we expect the caller to depend on. Its doc says "Strip info:doi prefix and any suffixes from a DOI", which isn't very clear to me.
In the translators repository, there are currently 36 translators making use of cleanDOI. I haven't checked the other Zotero components. I think we need to check how many of those calls depend on the current behaviour, before we go on to improve the regexp here.
My own thoughts
In general, the best that any code here could realistic do, is to provide some reasonable baseline accuracy. There's bound to be false positive and negatives when trying to "parse" or clean DOI because DOIs don't obey grammar. The DOI spec basically said they could be anything (printable Unicode graphic characters). It is explicitly said to be an opaque string, and nothing is supposed to be deduced based on features of the string alone.
So for the general-purpose utility function like this, we need to set realistic expectation and clearly document what it does exactly. We may have to leave the caller (especially translators) to implement their own further processing. The reason is that, given the opaque and arbitrary nature of DOIs and the diverse ways they can appear in, the best improvements should not come from the sophistication of a generic filter, but from domain-specific knowledge. Which is what goes into the individual translators.
But we need to check how many of the translators calling cleanDOI agree with this...
The text was updated successfully, but these errors were encountered:
utilities/utilities.js
Lines 415 to 426 in b93f16d
The regular expression will match a string like "10/example" or "10/10" (as in ten out of ten, great, A+), which is not valid DOI (prefix does not start with
10.
).In general,
cleanDOI()
returns the first matching substring if any. I'm not sure whether this is the intended behaviour that we expect the caller to depend on. Its doc says "Strip info:doi prefix and any suffixes from a DOI", which isn't very clear to me.In the translators repository, there are currently 36 translators making use of
cleanDOI
. I haven't checked the other Zotero components. I think we need to check how many of those calls depend on the current behaviour, before we go on to improve the regexp here.My own thoughts
In general, the best that any code here could realistic do, is to provide some reasonable baseline accuracy. There's bound to be false positive and negatives when trying to "parse" or clean DOI because DOIs don't obey grammar. The DOI spec basically said they could be anything (printable Unicode graphic characters). It is explicitly said to be an opaque string, and nothing is supposed to be deduced based on features of the string alone.
So for the general-purpose utility function like this, we need to set realistic expectation and clearly document what it does exactly. We may have to leave the caller (especially translators) to implement their own further processing. The reason is that, given the opaque and arbitrary nature of DOIs and the diverse ways they can appear in, the best improvements should not come from the sophistication of a generic filter, but from domain-specific knowledge. Which is what goes into the individual translators.
But we need to check how many of the translators calling
cleanDOI
agree with this...The text was updated successfully, but these errors were encountered: