Creating a blacklisting certain characters from variable and attribute names #323

larsbarring · 2024-05-31T09:55:25Z

larsbarring
May 31, 2024
Collaborator

Topic for discussion

In #237 it was suggested to substantially relax restrictions on which characters are allowed in variable and attribute names. The conversation is still ongoing and sprinkled in various comments there are examples of characters that should not be allowed, either because they have special meaning in the context of CF or netCDF as such, or otherwise identified as causing problems.

I suggest that we amend the text in section 2.3 to list which character and character ranges CF explicitly disallows, i.e. creating a blacklist. While it may not be possible to identify all characters that should be in such a list (it may even evolve over time) I think that it is helpful to identify those characters that we now know belong to such a list.

So, far I believe the following have been identified from the standard ASCII character set: <space>, control characters (decimal 0 ... 31, 127), /, :, \. This blacklist should probably be expanded to also include Unicode control and whitespace ~~and underscore~~ characters.

I addition, double underscores __ have special meaning in relation to OGC netCDF-LD, specifically for prefixes, and should be mentioned as reserved for that purpose to not create interoperability clashes.

davidhassell · 2024-05-31T11:19:13Z

davidhassell
May 31, 2024
Maintainer

This blacklist should probably be expanded to also include Unicode controls, whitespaces and underscore characters.

Surely we don't want to disallow underscores!

3 replies

larsbarring May 31, 2024
Collaborator Author

Of course not!!!! My mistake (now fixed)

efisher008 Jun 13, 2024
Maintainer

Are (en) dashes/hyphens - supported in variable and attribute names? Should this character be included in the list? Apparently em dashes — are not standard ASCII characters, so probably that does not need to be specified if names are ASCII-only by default.

sethmcg Jun 13, 2024
Collaborator

In Issue #477 we decided to allow ASCII period and ASCII hyphen in attribute names only.

So either there will need to be two lists, or the list will need to be structured to allow for differences in different contexts.

ChrisBarker-NOAA · 2024-06-13T18:39:08Z

ChrisBarker-NOAA
Jun 13, 2024
Collaborator

Hmm -- I like this idea. But first i think we should make clear what the (long term) goal is:

Unicode is very complex, with a lot of subtleties -- There are efforts to manage that with normalization (https://www.unicode.org/reports/tr15/), and categorization of code points. (General Category. Partition of the characters into major classes such as letters, punctuation, and symbols, and further subclasses for each of the major classes.) Etc.

So I think we have essentially three options:

Stick with ASCII -- and maybe add some extras (Latin1?) - this is not great -- really doesn't allow real internationalization -- I think there's general consensus not to do that.
Use the Unicode categorization to restrict allowable characters -- there are a manageable number of such categories (30-ish).
Allow any Unicode code point, except for a defined blacklist (that's what this discussion is about)

I think the whole point of this discussion is that we don't want to do (1) anymore.

for (2) -- it seems appealing, but there's a lot of complexity, e.g. (from the Unicode spec)

Similarly, characters whose General_Category identifies them primarily as a symbol or as a
mathematical symbol may function in other contexts as punctuation or even paired punctuation. The most obvious such case is for U+003C “<” less-than sign and U+003E “>”
greater-than sign. These are given the General_Category gc = Sm because their primary
identity is as mathematical relational signs. However, as is obvious from HTML and XML,
they also serve ubiquitously as paired bracket punctuation characters in many formal syntaxes.

So it can get messing. Nevertheless, there is precedent -- for instance, Python has the following rules:

https://docs.python.org/3/reference/lexical_analysis.html#identifiers

A bit messy, but do-able.

However, there are still a number of complications -- one is NFKC normalization, and another is that Python treats some different Unicode characters as equivalent (e.g. Blackboard Bold "B" U+1D539 is the same as capital B U+0042) -- but only in context where the normalization is done (e.g. processing source code, but not when meta-programming, like setattr()) (sorry can't find a reference at the moment).

Frankly, it's a bit of a mess if people really do use the broad range of allowable characters.

That being said, I think that the CF problem is easier than Python, as CF isn't providing normalization -- only enforcement.

I'm inclined (at the moment -- I haven't thought it through too carefully) to go with (3) -- allow any Unicode code point except a given blacklist. Note that I say Code Point, not character, as some characters can be represented by different code points (e.g. accented characters) If we simply do "Code Point", then there is no issue of normalization, or anything else.

(hmm, option 3(b) -- any code point, but a particular normalization?)

Though maybe that's too much a wild west?

0 replies

sethmcg · 2024-06-14T16:24:27Z

sethmcg
Jun 14, 2024
Collaborator

I'd like to course-correct the discussion a bit, if I may. This is not a proposal to expand the list of allowed characters in a wide-reaching way. That's what #237 is about, and a number of folks (including me and Lars) concluded that it would be unwise; there are a lot of security and interoperability concerns that make it important to consider any expansions of the list carefully and cautiously before adding them.

I believe what Lars is proposing is that we add an explicit, stand-alone listing of the sets of banned and allowed characters, rather than only having them defined implicitly in the text of section 3.2. I can see the value in that, but I think we shouldn't frame it as a list of banned characters, because that implies that anything not on the list is allowed, and as discussed in #237, there are important reasons that the default answer for whether a character is allowed should be "no". I think we should have an explicit list of allowed characters, with an accompanying list (maybe an extra column) of clarifications to cover the known disallowed characters that Lars suggests. So maybe something like:

Allowed	Clarification
`-`	`-` is the ASCII hyphen-minus, ASCII 45 / Unicode U+002D. Other dash characters (unicode en-dash, em-dash, minus sign, soft hyphen, non-breaking hyphen, etc.) are not allowed. This character is only allowed in attribute names, not variable names.

0 replies

ChrisBarker-NOAA · 2024-06-14T18:00:45Z

ChrisBarker-NOAA
Jun 14, 2024
Collaborator

@sethmcg: sorry about that -- I think it was me that expanded the conversation.

However, the reason I did that is that I don't see how we can talk about a blacklist without the context of what's allowed, so I was trying to get at that.

However, I think maybe I get it now -- this proposal for a "blacklist" is more internal to clearly define the rules now, and to guide any potential expansion in the future -- e.g.: whatever we do we won't allow THESE charactors :-)

I see the point of that, so carry on :-)

To that point:

"-" is the ASCII hyphen-minus, ASCII 45 / Unicode U+002D. Other dash characters (unicode en-dash, em-dash, minus sign, soft hyphen, non-breaking hyphen, etc.) are not allowed. This character is only allowed in attribute names, not variable names.

I find this odd to say -- are ANY other non-ascii charactors -- any number of other symbols, punctuation, etc allowed?

I think I get the point here, but it's a odd phrasing.

I think the point is that folks may be tempted to (or accidentally) use another symbol that "looks like" a dash.

In fact, I've had that issue in a totally different context, where something was copy and pasted from an application that had (helpfully) auto-changed an ascii dash to an endash.

So I don't see this as a blacklist so much as a "be cautious of these" list -- at least in that example.

Which I do think is good to document.

The real blacklist are the ones that will break other aspects of CF / netcdf (e.g. have special meaning in CDL)

-CHB

0 replies

sethmcg · 2024-06-14T18:57:45Z

sethmcg
Jun 14, 2024
Collaborator

@ChrisBarker-NOAA

I think the point is that folks may be tempted to (or accidentally) use another symbol that "looks like" a dash.
Yes, precisely. I think that would be a good addition to the conventions, and my impression is that that's what Lars is proposing, though I may be wrong.

I hadn't thought about compiling the list of characters that we definitely don't want to add for various technical reasons, just to have a consolidated reference for what they are and why they're banned. I agree that that would be a very useful thing to have, but I'm not sure about adding it to CF proper. I worry that people would see it and think of it as the complete list of all disallowed characters, and that everything else is allowed. Maybe we want to have that list, but make it an adjunct document of some kind, like the Guidelines for Constructing Standard Names? Or put it in an appendix?

0 replies

larsbarring · 2024-06-17T08:30:44Z

larsbarring
Jun 17, 2024
Collaborator Author

Wow, I was away from this issue for a few days while there have been a lot of activity and good points. When opening this discussion I had in mind was a rather modest extension to section 2.3, where the relevant part reads

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9, and similarly underscores means the standard ASCII underscore _. ... ... ... ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only.

Essentially this allows, as a recommendation, the US-ASCII (or their Unicode counterpart) letters and digits and underscore, as well as period and hyphen for attribute names. All other characters are implicitly not recommended (or "should not"), but not explicitly excluded or forbidden. What I had in mind was to marginally reduce this huge list of not recommended characters by explicitly disallowing the few characters that we already now know will create problems.

So far I am aware of the following, all within the US-ASCII character set, control characters (decimal 0 ... 31, 127), (space), /, \, : (I do not remember in what context the : surfaced, so maybe I am mistaken).

Based on this, my simplistic suggestion is to immediately after the text cited above add a sentence, something like

... ... ... ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only. The following ASCII characters must not be used: control characters (decimal 0 -31, 127), (space), /, \ and :.

In this minimal way we avoid all complications in relation to Unicode, and focus on those few we all agree, I think, cannot be used. All other punctuation (whether ASCII or Unicode), Unicode control and what not, remains as is, which basically means to be sorted out in the future.

0 replies

larsbarring · 2024-10-04T15:23:32Z

larsbarring
Oct 4, 2024
Collaborator Author

I have now explored this in some more detail using a python script to insert various unicode characters into the variable name in a small .cdl file and then use ncgen to generate a .nc file. In the same script I used NCO/ncrename trying to change the same character of a variable name in a working nc-file to all other characters in the list, and then use ncdump to create a cdl file. Thus it is not a full round-trip because the NCO step. I focussed on ASCII (decimal 0 - 127), ISO/IEC 8859-1 (decimal 0 - 255) and control (C1), as well as Unicode whitespace (WS) groups (all according to Wikipedia). Here is the result:

Code point	Decimal	Character "group"	ncgen / nco+ncdump
U+0000	0	ASCII "`nul`" Unicode control C0	NOT/NOT
U+0001 - U+0008	1 - 8	ASCII/Unicode control C0	NOT/OK
U+0009 - U+0010	9 - 10	ASCII/Unicode control C0	NOT/NOT
U+0011 - U+001F	11 - 31	ASCII/Unicode control C0	NOT/OK
U+0020	32	ASCII/ISO/IEC 8859-1 (space)	NOT/NOT
U+0021	33	ASCII/ISO/IEC 8859-1 `!`	NOT/OK
U+0022	34	ASCII/ISO/IEC 8859-1 `"`	NOT/NOT
U+0023 - U+0025	35 -37	ASCII/ISO/IEC 8859-1 `#` `$` `%`	NOT/OK
U+0026 - U+0029	38 - 41	ASCII/ISO/IEC 8859-1 `&` `'` `(` `)`	NOT/NOT
U+002A	42	ASCII/ISO/IEC 8859-1 *``**	NOT/OK
U+002B	43	ASCII/ISO/IEC 8859-1 `+`	OK/OK
U+002C	44	ASCII/ISO/IEC 8859-1 `,`	NOT/OK
U+002D - U+002E	45 - 46	ASCII/ISO/IEC 8859-1 `-` `.`	OK/OK
U+002F	47	ASCII/ISO/IEC 8859-1 `/`	NOT/OK
U+0030 - U+0039	48 - 57	ASCII/ISO/IEC 8859-1 digits	OK/OK
U+003A	58	ASCII/ISO/IEC 8859-1 `:`	NOT/OK
U+003B	59	ASCII/ISO/IEC 8859-1 `;`	NOT/NOT
U+003C - U+003F	60 - 63	ASCII/ISO/IEC 8859-1 `<` `=` `>` `?`	NOT/OK
U+0040	64	ASCII/ISO/IEC 8859-1 `@`	OK/OK
U+0041 - U+005A	65 - 90	ASCII/ISO/IEC 8859-1 `A` - `Z`	OK/OK
U+005B - U+005E	91 - 94	ASCII/ISO/IEC 8859-1 `[` `\` `]` `^`	NOT/OK
U+005F	95	ASCII/ISO/IEC 8859-1 `_`	OK/OK
U+0060	96	ASCII/ISO/IEC 8859-1 `	NOT/NOT
U+0061 - U+007A	97 - 122	ASCII/ISO/IEC 8859-1 `a` - `z`	OK/OK
U+007B	123	ASCII/ISO/IEC 8859-1 `{`	NOT/OK
U+007C	124	ASCII/ISO/IEC 8859-1 `\|`	NOT/NOT
u+007D - U+007E	125 - 126	ASCII/ISO/IEC 8859-1 `}` `~`	NOT/OK
U+007F	127	ASCII "`del`" Unicode control C0	NOT/OK
U+0080 - U+009F	128 - 159	Unicode control C1	OK/OK but screen printouts misbehave, some pretty badly (!)
U+00A0	160	ISO/IEC 8859-1, Unicode WS	OK/OK
U+00A1 - U+00FF	161 - 255	ISO/IEC 8859-1	OK/OK
U+1680 U+2000 - U+202A U+2028, U+2029 U+202F, U+205F U+3000	5760 8192 - 8202 8232, 8233 8239, 8287 12288	Unicode WS whitespace	OK/OK but screen printouts look strange

In doing this I used the most recent released version of the netCDF library tools (netcdf library version 4.9.2 of Jun 6 2024 10:57:38).

With respect to ASCII, I think that this is a pretty strong indication of which characters (groups) should not be not accepted in variable and attribute names.

And, yes, I do think that it better to be explicit about this and expressly rule out those characters we know are likely to cause problems because the CF conventions are all about data exchange and interoperability.

I think that it would be good to get such a statement into CF-1.12, what do you think ?

ping @sethmcg @ChrisBarker-NOAA @JonathanGregory @ethanrd @Dave-Allured @DocOtak @davidhassell

0 replies

JonathanGregory · 2024-10-04T17:43:14Z

JonathanGregory
Oct 4, 2024
Maintainer

Dear @larsbarring et al.

Thanks for your thorough investigation, Lars, and thanks everyone for the discussion. The text which Lars quoted above is not the working version. Following conventions issue #237, section 2.3 now reads

It is recommended that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9, and similarly underscores means the standard ASCII underscore _. Note that this is in conformance with the COARDS conventions, but is more restrictive than the netCDF interface which allows almost all Unicode characters encoded as multibyte UTF-8 characters (NUG Appendix B). The netCDF interface also allows leading underscores in names, but the NUG states that this is reserved for system use. ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only.

which is consistent with the conformance document. That is, as Lars says, we recommend against a lot of characters. All characters except letters, digits, underscores and (for attributes only) ASCII 2D . and 2E - are recommended not to be used.

In the discussion of conventions #237 we agreed that all characters are allowed, despite the recommendation (which is not a requirement) not to use the majority of them. Lars commented that the CF conventions "essentially provide a whitelist of explicitly allowed characters. All other characters are not recommended (or recommended against) but not explicitly disallowed. But throughout this conversation there have been several remarks that some characters should indeed be explicitly disallowed. This could easily be done by ... creating a blacklist." That's what this discussion is about, if I understand correctly.

The last sentence of the working text as above is unsatisfactory, despite #237, because it says . and - are "allowed". Those two characters are certainly "allowed", because all characters are allowed. What it means is that we aren't recommending against them. We should fix that.

The working text is also unsatisfactory because it implies that the NUG prohibits some characters ("it allows almost all Unicode characters ...") but it doesn't say which ones are not allowed. NUG Appendix B says that names should match the regular expression

([a-zA-Z0-9_]|{MUTF8})([^\\x00-\\x1F/\\x7F-\\xFF]|{MUTF8})*
MUTF8        = <multibyte UTF-8 encoded, NFC-normalized Unicode character>

I suppose we should understand the regular expression to begin with ^ and end with $ i.e. it's the complete name. Do you agree?

Since ASCII is a subset of UTF-8, I think that by "multibyte UTF-8 encoded", the NUG must mean a Unicode character which is encoded in more than one byte by UTF-8. That is, MUTF8 doesn't include one-byte characters, among them the ASCII characters 00-7F. Do you agree?

If that's correct, the NUG does not allow / (which is in the middle of the second [...] expression), or the one-byte characters 00-1F (ASCII control characters) and 7F-FF. Lars's experiments agree that ncgen does not work with the control characters, / and 7F, but apparently it does work with 80-FF.

I think we should explicitly state that we prohibit 00-1F, / and 7F-FF, if I'm correct that NUG doesn't allow them anyway. The CF text is currently vague, but it says CF is "more restrictive" than the netCDF interface, for which it cites the NUG (although Lars's experiment shows that the netCDF interface is more forgiving than the NUG). This implies that CF upholds NUG restrictions, as you'd expect.

Also, the the CF working text is inconsistent with the NUG in saying "It is recommended that variable, dimension, attribute and group names begin with a letter". This is not merely a recommendation, because the NUG says that names must begin with a letter, digit, underscore or multi-byte UTF-8 character. We should fix this. Our text currently implies it's OK to start a name with a punctuation mark, for instance, which the NUG prohibits.

Lars's experiment shows that ncgen doesn't allow space (20) or any of the puncuation marks 21-2F except + - and ., nor any of the symbols 3A-3F, 5B-5E, 60, 7B-7F. To put it positively, ncgen allows only letters, digits, _ + - . and @. Thus, ncgen is more restrictive than the NUG.

I think it would be reasonable for CF to prohibit all those characters which ncgen doesn't support, and which therefore could not be used in CDL. That would be a backward-incompatible change, which we don't normally make, if in fact any existing data uses any of those characters in netCDF names. Given Lars's experience, however, it seems unlikely anyone would have used them, despite NUG allowing them.

We've decided to allow . and - in attribute names, but not other names. What about + and @, which are the only two one-byte characters so far not considered. NUG and ncgen allow them, so I think CF should continue to allow them. At the moment, we recommend that they should not be used, since they aren't in our current whitelist.

Best wishes

Jonathan

0 replies

DocOtak · 2024-10-05T00:49:33Z

DocOtak
Oct 5, 2024
Maintainer

Looking over this and the long original question. Is it worth separating variables into two categories: variables meant to be interpreted in a CF way, and variables that are not? I'm of the opinion that variable names basically don't matter and that all of the actual information is going to be inside the attribute values.

I would propose that for variables that are intended to be interpreted as CF variables, we are very restrictive: ASCII letters a-zA-Z, numbers 0-9, and the low line _. But other variables that are not meant to be interpreted by CF readers be unrestricted by CF. Perhaps noting that the use of chars outside the range allowed for CF variables can still cause problems in some software even if you don't intend those variable to be read.

I think that adding . and - was a little quick/premature/misstep as it originated in a proposal that is in progress.

8 replies

sethmcg Oct 8, 2024
Collaborator

I'm a little confused. I'm not saying anything about what you should or shouldn't do, I'm saying that, by definition, if there's file content that doesn't follow the CF standards, that content is not CF-compliant. And if someone is generating non-compliant data, again, by definition, they're not following the standard. Why would we put something in the standard about what people should do when they're not doing what we say they should do?

If you have a reason to put non-compliant data into an otherwise compliant file, that's fine. It's perfectly valid to choose to deviate from the standard because you have some other concern that trumps it. But I figure if somebody has an overriding reason to ignore a section of the standard, they're going to ignore it regardless of what it says. So I don't get why we'd want to say, basically, "even if you're not following the standard, we recommend that you follow the standard."

If you're suggesting that we add some commentary about why this particular part of the standard is the way it is, I'm all for it. But saying "we recommend adhering to these restrictions" just seems tautologically redundant to me. In my mind, that's what the CF standard is: a set of restrictions that we recommend adhering to because it maximizes data reusability and makes it less likely that software will fail to read the data.

ChrisBarker-NOAA Oct 8, 2024
Collaborator

@DocOtak: I'm curious as to why you tin kit important to allow a wider range of characters?

If a user is paying attention to conventions when preparing the file, it doesn't seem like that big a deal to just make sure all your names conform.

I know what you mean by " not meant to be interpreted by generic software" -- but even if a variable is a special use-specific variable, many (most?) folks are still going to want to read it with "generic" software like xarray, or plotting software, or ... and having who knows what in a variable names can get tricky.

It sure seems safer to hvae one set of rules.

DocOtak Oct 8, 2024
Maintainer

I should have included advice for what generic software should do, that is, ignore the variables and attributes it doesn't know how to deal with but interpret the ones it can handle. I remember Matlab being pretty good at reading as much as it can. We should strive a bit to meet data producers where they are. I'd like to be able to welcome them to CF with minimal friction and, once they are in, apply the pressure to make their data more interoperable. In other words, incremental adoption of CF within a single data file should be supported by CF.

Perhaps this is my age, but UTF-8 has been a thing for basically my entire computer life. UTF 8 has been required to be supported by internet protocols since 1998 (26 years). So I often have a feeling of "wow that's antiquated" whenever ASCII gets mentioned. But, people have done foolish things with variable names that we get to deal with: they eval them a symbols in their favorite programing language, we (CF) use spaces to separate variable names (aside, I'd like to see a comeback of the File, Group, Record, and Unit Separators in "CSV" files), popular data servers like ERDDAP have their own restrictions on charsets for format translation reasons. Perhaps the most foolish thing is to assume that plain text exists on a computer and they aren't just byte sequences.

When it comes to input validation, having a blacklist of things tends to be the wrong way to go. Lars even notes that this might be an effort that would be ongoing and we may never identify all the things that need to be on the blacklist. So I'd propose the rules be as follows:

CF Variables MUST be named with a-zA-Z0-9 or "_" (whitelist) and MUST NOT start with "_" and SHOULD NOT start with 0-9.
- Starting with "_" is reserved by NUG
- Starting with 0-9 prevents the variable from being used as a symbol in most programing languages
Variables with any other char are not valid CF variables and SHOULD be ignored by generic software to allow incremental adoption of CF
Only valid CF Variables SHOULD be considered when evaluating the validity of the entire dataset/file

sethmcg Oct 15, 2024
Collaborator

@DocOtak - I appreciate the spirit of these rules, but I don't think the wording works. As it stands, it can be read as doing away with validity, since anything invalid should be ignored when evaluating validity. I think they could work if you talked about the validity of the name rather than the validity of the variable, though.

But I'm still a bit confused about the issue of incremental adoption. I think there's some part of this that I'm just not getting, and that I'm not fully understanding your concern. Do you have an example of one of these files that has parts that can't be made compliant?

DocOtak Oct 16, 2024
Maintainer

@sethmcg I think some context that might be missing is that my office manages a historic observational dataset that has had its own bespoke data formats since the early 90s so we struggled a bit with an all at once adoption. Of the concerns with CF specifically, was if extra attributes were allowed at all. We couldn't, at the time at least, find anything that said extra attributes of any kind were ok.

So I'm coming from a position of wanting an explicit answer to the question "can I have extra variables in my file?" I want the answer to be "yes, and you can do whatever you need to do with those extra variables without any restriction". Technically, that means I rules such as the ones I've proposed above need to exist.

larsbarring · 2024-10-08T15:03:57Z

larsbarring
Oct 8, 2024
Collaborator Author

A couple of further comments to my analysis and to the subsequent comments/responses:

Now looking more carefully at NUG Appendix B that @JonathanGregory refers to, I realize that I did not escape the "special2" characters (in that sense my test was naive)

// special2 chars are recently permitted in
// names (and require escaping in CDL).
// Note: '/' is not permitted.
special2 = ' ' | '!' | '"' | '#' | '$' | '%' | '&' | ''' |
'(' | ')' | '*' | ',' | ':' | ';' | '<' | '=' |
'>' | '?' | '[' | '\' | ']' | '^' | '`' | '{' |
'|' | '}' | '~'

Consequently, they should be regarded as "OK/OK" (though I have not tested).
ASCII Control (x00 - x1F) characters are already excluded by NUG, as is the forward slash /
Of the "special2" characters the space have specific meaning in CF, namely as a delimiter. This has been noted in several previous comments and strongly argued against allowing it (here, here, here). And @Dave-Allured suggested a "limited restriction":

Variable names included in blank-separated lists such as ancillary_variables or coordinates must not include the ASCII space character.

But given the other comments I would argue that it is safer to blacklist the space character altogether. I can imagine that someone think that it is a good idea to disseminate files having a variable (or global attribute) named GCM name and then someone else at a later stage wants to do some ensemble analysis thus trying to put this variable or attribute in a coordinate attribute to create an ensemble dimension.
Regarding the backslash \, NUG allows it (needs escaping), but there were comments against (here and here here, but without specific reason given. But as the backslash is commonly used as escape character (as per NUG) it would be easy to end up in complicated and error-prone combinations.
Moreover, @sethmcg commented

In addition to the aforementioned problem it causes with ancillary variables, it's not uncommon in my experience for a lot of netcdf processing to happen on the command line (rather than neatly encapsulated within the confines of a general-purpose library) by piping the output of ncdump -h through various commands, in particular cut and grep. Whitespace would play havoc with those kinds of workflows, which in my opinion makes it an absolute showstopper.

This use case also makes me very leery about the prospect of allowing any character that is a special character in the shell or a regular expression. Let's not set up a situation that demands lots of quoting.

Further, this suggests to me that if the list of allowed characters is to be expanded, it should be via a whitelist approach rather than a blacklist approach; i.e., the default should be that characters are disallowed unless they have been carefully vetted.

which attracted thumbs-up from @jesusff, myself, @lesserwhirls, @taylor13, @czender, and explicit support from @JonathanGregory. Disallowing (i.e. blacklisting) all characters not explicitly in the whitelist, as Seth suggests, goes against the conclusion when that issue cf-convention#237 was closed. Instead they belong to what might be called a "greylist", i.e. "allowed but not recommended" (or recommended against).

However, allowing but not recommending all characters not explicitly disallowed by NUG is problematic for the following reasons:

space MUST be explicitly disallowed to not break CF as explained above.
In the ASCII range (20 - 7F (decimal 32 - 127)) all characters belonging to the "special2" (listed above) category of NUG should be explicitly disallowed. This follows the suggestion by Seth and supported by many (as cited above). What then remains is (as per NUG):

// special1 chars have traditionally been
// permitted in netCDF names.
special1 = '_'|'.'|'@'|'+'|'-'
Moreover, I strongly argue that ASCII del (7F (decimal 127)) and all Unicode control C1 characters (80 - 9F (decimal 128 - 159)) should be explicitly disallowed. Typically, these are non-printing characters that are not visible in a ncdump printout. Exceptions are 84 (decimal 132) that creates a line-feed, 9A- 9B (decimal 154 -155) that deletes the subsequent character, and particularly problematic, 9D - 9F (decimal 157 - 159) that make the python script silently stop (i.e. it did not even crash) when trying to print, and the same goes for ncdump -h.
Finally, I would equally strongly argue that all Unicode WS whitespace characters should be explicitly disallowed. They are printed as one or more space that are not easy or even possible to distinguish from the ASCII space without special measures. I do not think that we want to end up in a situation where users receiving a file may have to resort to octal/hex dumps of the output from ncdump to actually understand the content (this comment is equally valid for the previous point!).

I suggest that these four points should form the basis for creating a "blacklist" of characters that CF explicitly disallows despite that they are allowed by NUG. In principle this is a breaking change of what we previously agreed on in cf-conventions/#237, which still belong to the current draft version, and in practice the suggested list of characters to blacklist are typically not the ones one would expect to prime targets for users to include in new files.

3 replies

larsbarring Oct 8, 2024
Collaborator Author

PS, I have no intention to check the remaining almost 137 thousand Unicode characters, graphemes, codepoints or whatever they are called :-)

sethmcg Oct 8, 2024
Collaborator

So Lars, to sum up, you are recommending that we explicitly ban the following characters from use in variable and attribute names because they are known to cause technical problems:

all Unicode whitespace characters (including ordinary space)
all of the NUG "special2" characters ( !"#$%&'()*,:;<=>?[]^`{|}~)
ASCII del and all Unicode control C1 characters

Yes?

I am 100% in favor, and support this recommendation wholeheartedly.

larsbarring Oct 8, 2024
Collaborator Author

Thanks Seth for the excellent and succinct summary! Yes, this is exactly what I suggest, and in the longish comment provide concrete reasons for why these character (groups) are singled out.

larsbarring · 2024-10-08T15:38:01Z

larsbarring
Oct 8, 2024
Collaborator Author

On a partly different aspect, @JonathanGregory commented

The last sentence of the working text as above is unsatisfactory, despite #237, because it says . and - are "allowed". Those two characters are certainly "allowed", because all characters are allowed. What it means is that we aren't recommending against them. We should fix that.

I fully agree. The question is how to fix it. @DocOtak noted

I think that adding . and - was a little quick/premature/misstep as it originated in a proposal that is in progress.

which refers to this issue, and in particular this comment. Before we fix this particular sentence I think we should get some input regardingtheir views. I will shortly make comment over there .

0 replies

ChrisBarker-NOAA · 2024-10-08T19:37:40Z

ChrisBarker-NOAA
Oct 8, 2024
Collaborator

all Unicode whitespace characters (including ordinary space)

all of the NUG "special2" characters ( !"#$%&'()*,:;<=>?[]^`{|}~)

ASCII del and all Unicode control C1 characters

I think the NUG special characters are the "blacklist" -- the rest could (should) be defined in terms of the Unicode categories:

https://www.compart.com/en/unicode/category

e.g. No control characters. (ASCII DEL is a control character)

And maybe it's time now to specify which categories are allowed / not allowed?

1 reply

larsbarring Oct 8, 2024
Collaborator Author

Yes, I do agree that we should be more specific compared to the outcome of cf-convention/#237.

JonathanGregory · 2024-10-08T21:59:07Z

JonathanGregory
Oct 8, 2024
Maintainer

Dear all

My understanding of NUG Appendix B (as in my previous post) is that NUG prohibits ASCII control characters (00-1F) and slash / (as @larsbarring says), ASCII DEL (7F), and all one-byte codes 80-FF, not just the Unicode C1 control characters. If NUG prohibits them, so should we. Have I misunderstood it?
On the other hand, we now understand that ncgen as well as NUG permit the "special2" characters !"#$%&'()*,:;<=>?[\]^`{|}~ (including backslash \) and ASCII space (20). Lars and Seth are in favour of prohibiting these in CF (blacklist), whereas currently we allow but recommend against them (greylist). That would be a backward-incompatible change, in our usual generous sense that data written up to CF 1.11 could use them in names, but they would give an error from a checker following CF 1.12. I certainly agree it would be a bad idea to use any of these characters in a name, since it is very likely to cause problems. Is that a sufficient reason for prohibiting them in CF? If the majority think so, I'd agree to it.
Lars proposes that other whitespace Unicode characters should be prohibited, for the same good reasons. Again, I would agree if the majority are happy. Such characters are Unicode character class "Space Separator", key Zs. They include the "Ogham space mark"  , which isn't invisible!
NUG permits the "special1" characters _.@+-. _ is already on our whitelist. We have recently put .- on the whitelist for attribute names. (We could reverse that decision easily, since it's not in a released version of CF yet.) .- for variable and dimension names are on the greylist. @+ are on the greylist for all names.
All other ASCII characters are on the whitelist (letters and digits), and would remain so.
All Unicode code points requiring more than one byte in UTF-8 are on the greylist, and we would leave them there.

Best wishes

Jonathan

2 replies

larsbarring Oct 8, 2024
Collaborator Author

Regarding your first point, I am not expert enough to fully parse the regular expression you refer to. While a regex is appropriate in the that section of the NUG, I think (drawing on myself) that it is relevant for CF to express the character blacklist in plain language.

JonathanGregory Oct 10, 2024
Maintainer

I certainly agree we should describe the characters in plain text! I'm not sure either what the NUG means, especially because regular expressions have several dialects, and different languages protect special characters in various ways. The RE is [^\\x00-\\x1F/\\x7F-\\xFF]. Within [ ] you can specify individual characters or ranges of characters. Inside this [ ] I see \\x00-\\x1F, / and \\x7F-\\xFF. / is a single character, and I suppose the ranges mean 00-1F in hexadecimal, and 7F-FF in hexadecimal. The ^ at the start means the [ ] does not match any of the things inside it.

JonathanGregory · 2024-10-14T17:06:18Z

JonathanGregory
Oct 14, 2024
Maintainer

Please note that @Dave-Allured has opened conventions issue 548 to delete the sentence, "ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only." in Sect 2.3. This sentence was inserted into the working version by conventions issue 477 for various reasons, including to support IETF BCP 47 language tags, discussed in conventions issue 528, which is still ongoing.

If Dave's proposal is accepted, the characters allowed for attribute names will be the same as for variable names in CF 1.12, which is the same as in CF 1.11, the most recently reduced version.

0 replies

ethanrd · 2024-10-15T00:59:04Z

ethanrd
Oct 15, 2024
Maintainer

Hi all - Sorry I'm late to this discussion. A few thoughts as I'm starting to catch up:

Please DO NOT consider the NUG a reliable source for Unicode information. The sections that mention Unicode were written some time ago (2008) and without an in-depth understanding of Unicode. I do feel confident saying the intent at the time was that the names of all netCDF objects (dimension, variable, attribute, group, etc.) should be valid UTF-8 strings that are NFC normalized and do not contain any control characters.

I believe the netCDF-C library validates that names are NFC normalized UTF-8 strings and without control characters (in the ASCII range) when creating a new netCDF dataset but not when reading (and maybe not when renaming). I believe the netCDF-Java library behaves in a simi8lar manner though I haven't tested as much.

I agree with the comment above from Chris @ChrisBarker-NOAA about using Unicode categories (list) to specify allowed and/or not allowed characters. Also an earlier comment about reviewing other documents on Unicode for identifiers/names, e.g., how the Python Language defines the syntax for Identifiers.

0 replies

larsbarring · 2024-10-16T10:49:44Z

larsbarring
Oct 16, 2024
Collaborator Author

There seems to be reasonable convergence towards creating a list of characters that CF explicitly disallows, and in doing so also updates the text regarding which characters CF explicitly recommends. If there is agreement, I can move on to create an issue over in the cf-conventions repo.

2 replies

Dave-Allured Oct 16, 2024

I do not agree. CF should honor the full allowed character set of the underlying file format. See previous discussion in cf-convention/cf-conventions#237.

A possible compromise would be to include a greylist, not a blacklist. However I also recommend against that, because it would be an unnecessary extension of an already complicated CF document. Section 2.3 is sufficient as it is now written on the allowed character set.

sethmcg Oct 16, 2024
Collaborator

CF should honor the full allowed character set of the underlying file format.

My takeaway from that discussion is that there are a number of good reasons NOT to add the entire set of unicode characters wholesale, and that the agreed-upon outcome was to stick with the status quo of 63-ish recommended/whitelisted characters for variable names (possibly plus a couple more for attribute names). Everything else is tacitly graylisted.

The proposal here is to create a relatively short blacklist explicitly banning characters that are already effectively unusable because they cause problems with existing software, and the point of doing that is so that the list of known trouble-makers is well-documented in a place where people can easily find it for future reference.

I agree that we don't want to overcomplicate the main document, so I would be in favor of putting it in an appendix of some kind. But I don't see any reason not to have the list of characters that are known to break things if you try to use them written down somewhere official and easily findable.

JonathanGregory · 2024-10-16T13:12:35Z

JonathanGregory
Oct 16, 2024
Maintainer

Dear @larsbarring et al.

@Dave-Allured mentions "the full allowed character set of the underlying file format". We were taking NUG as defining the character set, but Ethan's comment suggests that the NUG shouldn't be relied on concerning the character set. Therefore it seems even more important that CF should be crystal clear, both about what netCDF allows and disallows, and about any further restrictions of CF (grey or black).

I have been searching the ncgen code to find the actual definition of characters allowed in new netCDF names (of dimensions, variables and attributes), and I think that it may be the one in https://github.com/Unidata/netcdf-c/blob/main/ncgen/ncgen.l, at lines 183 and 198. Of course, I won't be surprised if I've got the wrong lines! The ones in question are

/*The partially relaxed version of UTF8, and the one used here */
UTF8 ([\xC0-\xD6][\x80-\xBF])|([\xE0-\xEF][\x80-\xBF][\x80-\xBF])|([\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])
idescaped \\[ !"#$%&'()*,:;<=>?\[\\\]^`{|}~]
numescaped \\[0-9]

/* New definition to conform to a subset of string.c */
ID ([a-zA-Z_]|{UTF8}|{numescaped})([a-zA-Z0-9_.@+-]|{UTF8}|{idescaped})*

Here, UTF8 is a definition of the legal 2-, 3- and 4-byte UTF-8 sequences for Unicde code points. UTF8 does not include the legal UTF-8 single-byte codes, which are the ASCII characters \x00-\x7F. There are a lot of backslashes in the middle of idescaped because [ \ and ] all need to be escaped. The result is that a new netCDF name must

Begin with an ASCII letter (a-z or A-Z), an ASCII digit (0-9), ASCII underscore _ or a multi-byte UTF8 sequence.
Continue with any of the permissible first characters (ASCII letters, digits, underscore and multi-byte UTF8) and also any of .@+- or !"#$%&'()*,:;<=>?[\]^`{|}~. I believe that means all ASCII characters \x00-\x7F are allowed except for the control characters \x00-\x1F, the space \x20, slash / and DEL \x7F. [Edited: I had forgotten to include / originally.]

Digression: to produce a backtick within backticks e.g. back`tick in GitHub markdown, you have to surround it all with two backticks instead of the usual single backtick, thus:

e.g. ``back`tick`` in GitHub markdown

For reading existing netCDF, I think the relevant file is https://github.com/Unidata/netcdf-c/blob/main/libdispatch/dstring.c, which perhaps was formerly called string.c, as mentioned in the above comment. This contains a function NC_check_name which verifies that a name string is valid. The allowed syntax, as a regular expression, is ([a-zA-Z0-9_]|{UTF8})([^\x00-\x1F\x7F/]|{UTF8})*, which agrees with the NUG.

The rule for the first character is the same as for creation of new names. The rule for the following characters is the same as well except that space is allowed. For characters after the first, the code has

	    /* handle simple 0x00-0x7f characters here */
	    if(ch <= 0x7f) {
                if( ch < ' ' || ch > 0x7E) /* control char or DEL */
		  goto fail;

and its else-block goes on to permit the start of a multi-byte UTF8. Slash / is separately excluded in any position by an earlier line in the code.

I hope that helps clarify the netCDF rules, which are our basis.

Best wishes

Jonathan

0 replies

JonathanGregory · 2024-10-16T14:09:23Z

JonathanGregory
Oct 16, 2024
Maintainer

The netCDF rules are black and white; either a character is allowed in a certain position in a name or it's not, except for the slight difference between reading and writing. I suggest that the CF text should state the following as the netCDF rules. Assuming I've got those rules right (above), the intersection of what it allows for reading and writing is:

Begin with an ASCII letter (a-z or A-Z), an ASCII digit (0-9), ASCII underscore _ or a multi-byte UTF8 sequence.
Continue with any of the permissible first characters (ASCII letters, digits, underscore and multi-byte UTF8) and also any of .@+- or !"#$%&'()*,:;<=>?[\]^`{|}~ i.e. all ASCII characters \x00-\x7F except for the control characters \x00-\x1F, the space \x20, slash / and DEL \x7F.
ASCII control characters \x00-\x1F, space \x20, slash / and DEL \x7F are not permitted in any position.
Single-byte character codes \x80-\xFF are not permitted in any position. This excludes "Latin 1" and "extended ASCII" ISO/IEC 8859-1 and ISO/IEC 8859. These aren't allowed by dstring.c. Some codes in the range \x80-\xFF are allowed, but only as the first byte of a multi-byte UTF-8 sequence.

Having stated the netCDF rules we can compare them with what CF says. The first paragraph of Sect 2.3 is currently

It is recommended that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9, and similarly underscores means the standard ASCII underscore _. Note that this is in conformance with the COARDS conventions, but is more restrictive than the netCDF interface which allows almost all Unicode characters encoded as multibyte UTF-8 characters (NUG Appendix B). The netCDF interface also allows leading underscores in names, but the NUG states that this is reserved for system use. ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only.

The last sentence would be deleted by conventions issue 548. Please comment in that issue if you agree or disagree with removing it. Let's assume in the following that we delete it, so that the CF rules for attribute names are the same as for other names. As we've discussed already, CF has a greylist of characters which are permitted but not recommended. By comparison of the above with the netCDF rules, the greylist is:

ASCII digits and ASCII underscore _ are not recommended as the first character. (We recommend only ASCII letters.)
ASCII .@+-!"#$%&'()*,:;<=>?[\]^`{|}~ and ASCII space are not recommended as subsequent characters. (NetCDF does not allow space when creating new names, but tolerates it in existing names.)
Multi-byte UTF-8 sequences are not recommended in any position.

This issue is considering whether we want to move any characters among the CF whitelist (NetCDF permitted minus CF greylist), greylist and blacklist (NetCDF prohibited).

Lars has proposed or suggested that

Unicode characters of class Zs "space separator" should be moved from the greylist to the blacklist in any position, including the ASCII space (which netCDF does not allow in new names anyway).
Unicode characters of class Co "control characters" should be blacklisted in any position. I don't think we need to do that, because NetCDF doesn't allow them i.e. they're already on the blacklist.
Unicode characters !"#$%&'()*,:;<=>?[\]^`{|}~ (the NUG special2 characters) should be moved from the greylist to the blacklist in any position. (NetCDF doesn't allow them as first character.) I agree with @sethmcg that we needn't give a name to this character class. We can just list them and their character codes.

I'm essentially restating what Lars has said, but perhaps it's clearer in the context of netCDF.

@Dave-Allured disagrees with blacklisting any character allowed by netCDF.

Best wishes

Jonathan

0 replies

JonathanGregory · 2024-10-16T19:43:22Z

JonathanGregory
Oct 16, 2024
Maintainer

Seth writes

I agree that we don't want to overcomplicate the main document, so I would be in favor of putting it in an appendix of some kind. But I don't see any reason not to have the list of characters that are known to break things if you try to use them written down somewhere official and easily findable.

Although this discussion is quite long, I think that stating the rules would not take much more space than it already does - a short extra paragraph at most, even if we decide to add to the blacklist as proposed by Lars. So I think it's worth keeping in sect 2.3, where it is now.

Personally, I think it would be sensible to blacklist the "special2" and space-like characters, because it would support this one of the CF principles in 1.2:

The conventions should minimise the possibility for mistakes by data-writers and data-readers.

Cheers, Jonathan

0 replies

larsbarring · 2024-10-22T21:03:59Z

larsbarring
Oct 22, 2024
Collaborator Author

To be transparent about what I did in the tests, below is the python code (beware I am neither a developer, nor a pythonista...) and necessary files in one .zip file, and the resulting .cdl and .nc files in another .zip file. If you run the python code you will get a screen printout giving brief indication of the problems.

I ran this in bash on a linux box, and I am not sure which of the problems encountered are only from with the ncgen/ncdump/ncrename, and what might be only to bash/linux, and what might be due to some interaction between them. Nevertheless, I do think that when recommending ("whitelisting") a set of Unicode characters we need to take into account problems related to any of these sources (irrespective of which operating system is used (some limits may apply...)) and not only what the netCDF-C library accepts.

While I think we have not reached consensus it seems a majority is leaning towards adding an explicit blacklist. I am not sure if/how to move this discussion forward.

test_cf_unicode_files.zip (~1 Mb)
test_cf_unicode_prog.zip (~6 kb)

1 reply

Dave-Allured Oct 22, 2024

Personally, I think it would be sensible to blacklist the "special2" and space-like characters, because it would support this one of the CF principles in 1.2:

The conventions should minimise the possibility for mistakes by data-writers and data-readers.

I understand the sentiment. However, character restrictions have unintended consequences, and CF aspires for wider application. Also if taken seriously, this becomes a rathole of endless discussion and intricacy.

Let me restate. I see the CF primary role as specifying descriptive metadata and structure. CF should stay out of the character set business. That is an area that does not really need our help.

(edit: posted out of order, sorry.)

Dave-Allured · 2024-10-22T21:57:55Z

Dave-Allured
Oct 22, 2024

@larsbarring how to move forward ... Some have suggested a condensed graylist version. Would you be able to post a condensed version, or is that too far away from your goal?

4 replies

larsbarring Oct 22, 2024
Collaborator Author

To my mind we have three categories:

a limited set of characters that CF explicitly recommends (RECOMMENDED to be used [to maximize interoperability]) -- a whitelist
a limited set of characters that CF explicitly disallows (MUST NOT be used [due to known technical problems, or otherwise substantial risk for identified problems at the data readers' end] -- a blacklist
all other characters are allowed by CF (MAY be used by data writers' at their discretion) -- a graylist

@Dave-Allured I am not sure I understand what you mean by a graylist: Do you suggest that what I here call the blacklist becomes the graylist? If so, what happens to all the remaining characters that belongs to what I now call the graylist? I am thinking of this in terms of BCP-14, which I take only have three categories (I don't think the REQUIRED category is applicable here). Are you suggesting another category? In that case, how to describe it (using -- or not using -- BCP-14)?

Or I must have missed the comment suggesting a condensed graylist version -- in that case my apologies.

Dave-Allured Oct 22, 2024

Graylist: a limited set of characters that CF recommends against, but does not explicitly disallow. This would be a mixture of explicit single characters and categories. Hopefully, a short list.

Dave-Allured Oct 23, 2024

"condensed graylist version" -- my bad. I meant the combination of a short paragraph, suggested several times; with greylist rather than blacklist.

Let me rephrase. Black or grey, either way, can you rewrite your current proposal in a brief paragraph or two? Or do you see it necessary to construct a significant table?

larsbarring Oct 23, 2024
Collaborator Author

OK, now I better understand. I have not thought too much in detail about how to write the text, because I thougth that would happen when we move the discussion over to a cf-conventions issue and associated PR. I do not want to see a significant table (absolutely not something like one in my earlier comment!)

I like the idea of referring to Unicode classes (like Chris mentions below), but I also think that it is useful to what this means in Unicode codepoint ranges. Whether this can most clearly and efficiently be done within a few paragraphs, or summarized into a short (!) table I do not know and have no strong opinion about. Again I think this is for the issue and PR to tease out.

Moreover, I have just revisited the BCP-14 page and was reminded that there actually is "a fourth category", namely NOT RECOMMENDED (or SHOULD NOT). While I still prefer MUST NOT, I am open to discuss relaxing this in favour of SHOUD NOT. But, again, I think this belongs in the issue/PR.

ChrisBarker-NOAA · 2024-10-22T22:56:14Z

ChrisBarker-NOAA
Oct 22, 2024
Collaborator

r.e.:
(lost track of who said what, sorry)

While I think we have not reached consensus it seems a majority is leaning towards adding an explicit blacklist.

This conversation has drifted a bit to the larger issues (whitelist, greyilst, etc...) -- but I htink it is a good idea to tighten it for now, and yes, do a blacklist.

and

Let me restate. I see the CF primary role as specifying descriptive metadata and structure. CF should stay out of the character set business. That is an area that does not really need our help.

Well, maybe, maybe not -- "the character set business" is a mess, and it does interact with many things of CF concern -- essentially its reason for being -- we want data sets to be easily human and machine readable, and unambiguous in their meaning -- and dealing, at some level, with "the character set business" furthers that goal.

Related -- HDF5 and netcdf have defined that UTF-8 shall be the text encoding. That is a HUGE win over the any old encoding is OK world. I'm not sure what we are talking about is that different in concept.

And it's easier to relax than tighten restrictions later.

Back to the blacklist:

In addition to the specific characters identified, specifically blacklisting a couple Unicode classes might make sense, e.g. what Jonathan wrote out a couple posts back, which I think is:

Separator (Z): line (Zl), paragraph (Zp), space (Zs)

Other (C): control (Cc), format (Cf), not assigned (Cn), private use (Co), surrogate (Cs)

We may also want to limit some (or all) of:

Mark (M): spacing combining (Mc), enclosing (Me), non-spacing (Mn)

(Though I'm a touch confused by what those actually are)

We should probably also mention in the same place that names should be NFC normalized (which is specified in the NUG)

0 replies

Dave-Allured · 2024-10-23T14:49:40Z

Dave-Allured
Oct 23, 2024

@larsbarring:

test_cf_unicode_files.zip
test_cf_unicode_prog.zip

Please remove U+0000 through U+001F from the Unicode test table, to reduce clutter. This range is already excluded from the netcdf base character set, and not controversial.
Line 72 of your test program inserts an unquoted arbitrary string onto the linux command line. Of course this will cause problems for known active shell characters. Please add proper quoting for "v1". Curly brackets alone are not sufficient.

ncrename -v {v2},{v1}

0 replies

Dave-Allured · 2024-10-23T14:53:40Z

Dave-Allured
Oct 23, 2024

Please reduce the prototype CDL file to a single data variable with no attributes, to reduce clutter. This will have no effect on the validity of your test.

0 replies

JonathanGregory · 2024-10-23T17:56:48Z

JonathanGregory
Oct 23, 2024
Maintainer

Dear Lars

To summarise my earlier posting, I think we should replace the first paragraph of 2.3 along these lines:

The NetCDF interface requires the following for the name of any variable, dimension, attribute and group:

It must begin with an ASCII letter (a-z or A-Z), an ASCII digit (0-9), ASCII underscore _ or an NFC-normalised Unicode codepoint encoded in UTF-8 and requiring more than one byte.
The rest of the name can include any ASCII character or multi-byte UTF-8 Unicode codepoint except that it must not include any ASCII control character (decimal 0-31, hexadecimal 00-1F), space (decimal 32, hexadecimal 20), slash / (decimal 47, hexadecimal 2F) or DEL (decimal 127, hexadecimal 7F).

In addition to the NetCDF requirements, in CF

The name is recommended to begin with a letter, not a digit or _.
The name is recommended not to include any multi-byte UTF-8 Unicode codepoint.

and either

The name is required not to include any of the ASCII characters !"#$%&'()*,:;<=>?[\]^`{|}~ or any Unicode codepoint in class Zs "space separator".

or

The name is recommended not to include any of the ASCII characters !"#$%&'()*,:;<=>?[\]^`{|}~.

The or version is the status quo. Whether to adopt the either alternative is the main point at issue, I believe.

Best wishes

Jonathan

0 replies

ChrisBarker-NOAA · 2024-10-23T18:34:39Z

ChrisBarker-NOAA
Oct 23, 2024
Collaborator

It must begin with an ASCII letter (a-z or A-Z), an ASCII digit (0-9), ASCII underscore _ or an NFC-normalised Unicode codepoint encoded in UTF-8 and requiring more than one byte.

wait, really! WTF? that makes absolutely no sense, as I read it "an NFC-normalised Unicode codepoint encoded in UTF-8 and requiring more than one byte" -- that is EVERY code point in Unicode -- including punctuation, control characters, various whitespace, -- this list goes on. Very, very odd that they could disallow all the non letter and digit ASCII codepoints, but allow all the non-ascii ones -- Huh?

I think it was Ethan that said that the netcdf handling of Unicode should not be considered thoughtful.

Anyway -- we probably should bring this up with the netcdf folks, but in the meantime, CF can be more restrictive, and it absolutely should be.

Perhaps we can re-define all this with a more appropriate extension from ASCII to Unicode -- e.g. "control code points are disallowed", "Letters are allowed" -- obviously spelled out in the proper language of Unicode.

NOTE: this is distinct from the blacklist issue -- which I do support.

0 replies

ChrisBarker-NOAA · 2024-10-23T18:36:52Z

ChrisBarker-NOAA
Oct 23, 2024
Collaborator

Side note:

The search on the NUG here:

https://docs.unidata.ucar.edu/nug/current/index.html

is broken (I get a 404) for me.

how do I report that?

4 replies

Dave-Allured Oct 23, 2024

Works for me. Double checked, not cached. Off topic, let's take this off line. Reply to me privately if needed. Go through my icon.

Dave-Allured Oct 23, 2024

Oh sorry, now I see that you mean the "search box". They fixed that in the new NUG version 2, currently in draft.
https://docs.unidata.ucar.edu/nug/2.0-draft/

ChrisBarker-NOAA Oct 23, 2024
Collaborator

OT -- but thanks!

unfortunately, while that search box doesn't 404, it seems to only be searching the headings -- so I"m available to find any discussion of Unicode :-( -- but yes, OT.

Dave-Allured Oct 23, 2024

No longer OT. Try "site:docs.unidata.ucar.edu/nug unicode" in generic search engine such as google.

This did not work as expected for me, but it got me to the file format appendices in both version 1 and version 2 docs. They both address Unicode, and say about the same things.

ChrisBarker-NOAA · 2024-10-23T23:37:46Z

ChrisBarker-NOAA
Oct 23, 2024
Collaborator

Back to On Topic: with Google's help I found the relevant text in the NUG:
(https://docs.unidata.ucar.edu/nug/current/file_format_specifications.html)
I couldn't find it in the draft version 2...

Note on names: Earlier versions of the netCDF C-library reference implementation enforced a more restricted set of characters in creating new names, but permitted reading names containing arbitrary bytes.

This specification extends the permitted characters in names to include multi-byte UTF-8 encoded Unicode and additional printing characters from the US-ASCII alphabet. The first character of a name must be alphanumeric, a multi-byte UTF-8 character, or '_' (reserved for special names with meaning to implementations, such as the “_FillValue” attribute). Subsequent characters may also include printing special characters, except for '/' which is not allowed in names. Names that have trailing space characters are also not permitted.

So that's the NUG text -- and its handling of the Unicode addition is odd (or poorly written, or ...)

Perhaps what they mean by: "a multi-byte UTF-8 character" is actually:

"A Unicode "Letter" character", i.e. (Lu | Ll | Lt)

or maybe all L* code points? or ???

In any case, we certainly don't want control code points in there, and having no ASCII punctuation, but allow other punctuation as the first character makes no sense. and can a name start with a "combining lowline"? (https://unicode-explorer.com/c/0332) [1]

Where would one go to suggest an update to the NUG?

But in the meantime, CF can specify this all more clearly and precisely.

Should we start a new discussion for that, and keep this one (re)focused on the Blacklist?

[1] Just for fun -- here's an experiment:

In [19]: name_with_leading_combining_lowline
Out[19]: '̲aname'

In [20]: print([ord(c) for c in name_with_leading_combining_lowline])
[818, 97, 110, 97, 109, 101]

Notice how when I print the name, it ends up combining leading lowline with the quote character -- fun!

1 reply

Dave-Allured Oct 24, 2024

Despite some recent claims to the contrary, I have found the character set descriptions in the NUG to be perfectly accurate, precise, and up to date. The NUG descriptions take a bit of study and getting used to, but they are right on the mark, including that regex-y thing in the BNF.

Jonathan's recent interpretations are also accurate and precise. I did not check every last character in his lists, but I have no reason for doubt.

The one thing that I have not checked is the NUG versus the actual character guards in the most recent netcdf-C code version. I recall no recent changes, therefore no reason to believe there is any discrepancy. That part of the code is quite readable. So if anyone cares that much, go for it.

Because of the accuracy of current understandings -- ouch -- for those who care to study very carefully -- I suggest do not start a new issue at this time. We already have too much chaos in the character and string arena going on right now. Stay focused.

Dave-Allured · 2024-10-24T00:10:29Z

Dave-Allured
Oct 24, 2024

Here is how to understand "multi-byte UTF-8 character" as used in the NUG. Their abbreviation is MUTF8.

Today's UTF-8 includes byte sequences of 1, 2, 3, and 4 bytes. MUTF8 is ALL legal sequences, except for the 1-byte encodings. If you combine the single-byte sequences with MUTF8, you get the complete UTF-8 set.

0 replies

ChrisBarker-NOAA · 2024-10-24T00:51:11Z

ChrisBarker-NOAA
Oct 24, 2024
Collaborator

"ALL legal sequences" -- fair enough, that's how I interpreted it too -- but allowing ALL of these, including as the leading character of a name, makes no sense at all.

So this:

"The first character of a name must be alphanumeric, a multi-byte UTF-8 character, or '_'"

And as I parse it, you can't use any of the ASCII punctuation marks as a leading charactor, but you can use any non-ascii punctuation charactor, for instance. huh?

In fact, you can use ANY non-ascii "character" -- including combining ones, whitespace, line feeds, other control characters, etc, etc. Really? If you are going to do that, why have any rules at all?

This is reminiscent of the kerfuffle over Unicode as the core string type in Python3 -- the only really challenging problem was file names. (sure there were issues with existing mojibake's data, etc, but those were mostly surmountable).

[I'll bring this around to the topic at hand, I promise]

The big issue was that apparently on nix systems, filenames (paths, etc) are simply stored as a char, and the only special values are null and 47 (/ the ASCII forward slash). This all worked great in the ASCII days, and not too badly in the extended ANSI days (e.g. latin-1, etc, etc...). However, the result was that folks could use pretty much any encoding, all on the same file system, and there was no way to know what the encoding was for any given path.

And all that is totally fine if all you need to do is pass a char* around, split on the slash, and maybe compare to other filenames.

And that all worked fine in Python2, where a string was simply a null-terminated string of bytes (i.e. a char*).

Enter Python3 and Unicode -- now you had to decode what's in the char* in order for Python to. be able to store it in a string. and that's not possible if you don't know the encoding.

This was a very long kerfuffle -- with folks writing, e.g. unix utilities, saying, 'why can't I just pass around the pile of bytes? I don't care what characters they actually mean -- within the code, it's just a pile of bytes.

And within the code, sure -- who cares? But what happens when you want to read that filename from a file? (or a web service) or write it to a text file, or show it to a person on the screen, or ?? The fact is, that outside of a computer program, filenames are text, and it's really helpful to have them be well described, human readable, etc...

Back to the topic at hand:

The NUG has selected utf-8 (and NFC normalization) so at least that's not a problem. And I can easily write code that can work with variable names, attribute. names, etc with any old code points in them -- (I use Python, so if it is valid utf-8, it can be decoded into a Python string, and I can do all. sorts of stuff with it -- no problem) -- other systems could work directly with the utf-8 encoded bytes.

But for CF -- we want files to be both computer and human readable -- an ncdump of the file should be comprehendible (and not trash your terminal settings).

And for THAT, it's a good idea to put some restrictions on allowable code points.

BTW -- my idea to start another discussion was so that we could focus this one on only the Blacklist idea.

0 replies

Creating a blacklisting certain characters from variable and attribute names #323

larsbarring May 31, 2024 Collaborator

Topic for discussion

Replies: 34 comments · 36 replies

davidhassell May 31, 2024 Maintainer

larsbarring May 31, 2024 Collaborator Author

efisher008 Jun 13, 2024 Maintainer

sethmcg Jun 13, 2024 Collaborator

ChrisBarker-NOAA Jun 13, 2024 Collaborator

sethmcg Jun 14, 2024 Collaborator

ChrisBarker-NOAA Jun 14, 2024 Collaborator

sethmcg Jun 14, 2024 Collaborator

larsbarring Jun 17, 2024 Collaborator Author

larsbarring Oct 4, 2024 Collaborator Author

JonathanGregory Oct 4, 2024 Maintainer

DocOtak Oct 5, 2024 Maintainer

sethmcg Oct 8, 2024 Collaborator

ChrisBarker-NOAA Oct 8, 2024 Collaborator

DocOtak Oct 8, 2024 Maintainer

sethmcg Oct 15, 2024 Collaborator

DocOtak Oct 16, 2024 Maintainer

larsbarring Oct 8, 2024 Collaborator Author

larsbarring Oct 8, 2024 Collaborator Author

sethmcg Oct 8, 2024 Collaborator

larsbarring Oct 8, 2024 Collaborator Author

larsbarring Oct 8, 2024 Collaborator Author

ChrisBarker-NOAA Oct 8, 2024 Collaborator

larsbarring Oct 8, 2024 Collaborator Author

JonathanGregory Oct 8, 2024 Maintainer

larsbarring Oct 8, 2024 Collaborator Author

JonathanGregory Oct 10, 2024 Maintainer

JonathanGregory Oct 14, 2024 Maintainer

ethanrd Oct 15, 2024 Maintainer

larsbarring Oct 16, 2024 Collaborator Author

sethmcg Oct 16, 2024 Collaborator

JonathanGregory Oct 16, 2024 Maintainer

JonathanGregory Oct 16, 2024 Maintainer

JonathanGregory Oct 16, 2024 Maintainer

larsbarring Oct 22, 2024 Collaborator Author

larsbarring Oct 22, 2024 Collaborator Author

larsbarring Oct 23, 2024 Collaborator Author

ChrisBarker-NOAA Oct 22, 2024 Collaborator

JonathanGregory Oct 23, 2024 Maintainer

ChrisBarker-NOAA Oct 23, 2024 Collaborator

ChrisBarker-NOAA Oct 23, 2024 Collaborator

ChrisBarker-NOAA Oct 23, 2024 Collaborator

ChrisBarker-NOAA Oct 23, 2024 Collaborator

ChrisBarker-NOAA Oct 24, 2024 Collaborator

larsbarring
May 31, 2024
Collaborator

Replies: 34 comments 36 replies

davidhassell
May 31, 2024
Maintainer

larsbarring May 31, 2024
Collaborator Author

efisher008 Jun 13, 2024
Maintainer

sethmcg Jun 13, 2024
Collaborator

ChrisBarker-NOAA
Jun 13, 2024
Collaborator

sethmcg
Jun 14, 2024
Collaborator

ChrisBarker-NOAA
Jun 14, 2024
Collaborator

sethmcg
Jun 14, 2024
Collaborator

larsbarring
Jun 17, 2024
Collaborator Author

larsbarring
Oct 4, 2024
Collaborator Author

JonathanGregory
Oct 4, 2024
Maintainer

DocOtak
Oct 5, 2024
Maintainer

sethmcg Oct 8, 2024
Collaborator

ChrisBarker-NOAA Oct 8, 2024
Collaborator

DocOtak Oct 8, 2024
Maintainer

sethmcg Oct 15, 2024
Collaborator

DocOtak Oct 16, 2024
Maintainer

larsbarring
Oct 8, 2024
Collaborator Author

larsbarring Oct 8, 2024
Collaborator Author

sethmcg Oct 8, 2024
Collaborator

larsbarring Oct 8, 2024
Collaborator Author

larsbarring
Oct 8, 2024
Collaborator Author

ChrisBarker-NOAA
Oct 8, 2024
Collaborator

larsbarring Oct 8, 2024
Collaborator Author

JonathanGregory
Oct 8, 2024
Maintainer

larsbarring Oct 8, 2024
Collaborator Author

JonathanGregory Oct 10, 2024
Maintainer

JonathanGregory
Oct 14, 2024
Maintainer

ethanrd
Oct 15, 2024
Maintainer

larsbarring
Oct 16, 2024
Collaborator Author

sethmcg Oct 16, 2024
Collaborator

JonathanGregory
Oct 16, 2024
Maintainer

JonathanGregory
Oct 16, 2024
Maintainer

JonathanGregory
Oct 16, 2024
Maintainer

larsbarring
Oct 22, 2024
Collaborator Author

larsbarring Oct 22, 2024
Collaborator Author

larsbarring Oct 23, 2024
Collaborator Author

ChrisBarker-NOAA
Oct 22, 2024
Collaborator

JonathanGregory
Oct 23, 2024
Maintainer

ChrisBarker-NOAA
Oct 23, 2024
Collaborator

ChrisBarker-NOAA
Oct 23, 2024
Collaborator

ChrisBarker-NOAA Oct 23, 2024
Collaborator

ChrisBarker-NOAA
Oct 23, 2024
Collaborator

ChrisBarker-NOAA
Oct 24, 2024
Collaborator