Creating a blacklisting certain characters from variable and attribute names #323
Replies: 34 comments 36 replies
-
Surely we don't want to disallow underscores! |
Beta Was this translation helpful? Give feedback.
-
Hmm -- I like this idea. But first i think we should make clear what the (long term) goal is: Unicode is very complex, with a lot of subtleties -- There are efforts to manage that with normalization (https://www.unicode.org/reports/tr15/), and categorization of code points. (General Category. Partition of the characters into major classes such as letters, punctuation, and symbols, and further subclasses for each of the major classes.) Etc. So I think we have essentially three options:
I think the whole point of this discussion is that we don't want to do (1) anymore. for (2) -- it seems appealing, but there's a lot of complexity, e.g. (from the Unicode spec)
So it can get messing. Nevertheless, there is precedent -- for instance, Python has the following rules: https://docs.python.org/3/reference/lexical_analysis.html#identifiers A bit messy, but do-able. However, there are still a number of complications -- one is NFKC normalization, and another is that Python treats some different Unicode characters as equivalent (e.g. Blackboard Bold "B" U+1D539 is the same as capital B U+0042) -- but only in context where the normalization is done (e.g. processing source code, but not when meta-programming, like Frankly, it's a bit of a mess if people really do use the broad range of allowable characters. That being said, I think that the CF problem is easier than Python, as CF isn't providing normalization -- only enforcement. I'm inclined (at the moment -- I haven't thought it through too carefully) to go with (3) -- allow any Unicode code point except a given blacklist. Note that I say Code Point, not character, as some characters can be represented by different code points (e.g. accented characters) If we simply do "Code Point", then there is no issue of normalization, or anything else. (hmm, option 3(b) -- any code point, but a particular normalization?) Though maybe that's too much a wild west? |
Beta Was this translation helpful? Give feedback.
-
I'd like to course-correct the discussion a bit, if I may. This is not a proposal to expand the list of allowed characters in a wide-reaching way. That's what #237 is about, and a number of folks (including me and Lars) concluded that it would be unwise; there are a lot of security and interoperability concerns that make it important to consider any expansions of the list carefully and cautiously before adding them. I believe what Lars is proposing is that we add an explicit, stand-alone listing of the sets of banned and allowed characters, rather than only having them defined implicitly in the text of section 3.2. I can see the value in that, but I think we shouldn't frame it as a list of banned characters, because that implies that anything not on the list is allowed, and as discussed in #237, there are important reasons that the default answer for whether a character is allowed should be "no". I think we should have an explicit list of allowed characters, with an accompanying list (maybe an extra column) of clarifications to cover the known disallowed characters that Lars suggests. So maybe something like:
|
Beta Was this translation helpful? Give feedback.
-
@sethmcg: sorry about that -- I think it was me that expanded the conversation. However, the reason I did that is that I don't see how we can talk about a blacklist without the context of what's allowed, so I was trying to get at that. However, I think maybe I get it now -- this proposal for a "blacklist" is more internal to clearly define the rules now, and to guide any potential expansion in the future -- e.g.: whatever we do we won't allow THESE charactors :-) I see the point of that, so carry on :-) To that point:
I find this odd to say -- are ANY other non-ascii charactors -- any number of other symbols, punctuation, etc allowed? I think I get the point here, but it's a odd phrasing. I think the point is that folks may be tempted to (or accidentally) use another symbol that "looks like" a dash. In fact, I've had that issue in a totally different context, where something was copy and pasted from an application that had (helpfully) auto-changed an ascii dash to an endash. So I don't see this as a blacklist so much as a "be cautious of these" list -- at least in that example. Which I do think is good to document. The real blacklist are the ones that will break other aspects of CF / netcdf (e.g. have special meaning in CDL) -CHB |
Beta Was this translation helpful? Give feedback.
-
I hadn't thought about compiling the list of characters that we definitely don't want to add for various technical reasons, just to have a consolidated reference for what they are and why they're banned. I agree that that would be a very useful thing to have, but I'm not sure about adding it to CF proper. I worry that people would see it and think of it as the complete list of all disallowed characters, and that everything else is allowed. Maybe we want to have that list, but make it an adjunct document of some kind, like the Guidelines for Constructing Standard Names? Or put it in an appendix? |
Beta Was this translation helpful? Give feedback.
-
Wow, I was away from this issue for a few days while there have been a lot of activity and good points. When opening this discussion I had in mind was a rather modest extension to section 2.3, where the relevant part reads
Essentially this allows, as a recommendation, the US-ASCII (or their Unicode counterpart) letters and digits and underscore, as well as period and hyphen for attribute names. All other characters are implicitly not recommended (or "should not"), but not explicitly excluded or forbidden. What I had in mind was to marginally reduce this huge list of not recommended characters by explicitly disallowing the few characters that we already now know will create problems. So far I am aware of the following, all within the US-ASCII character set, control characters (decimal 0 ... 31, 127), Based on this, my simplistic suggestion is to immediately after the text cited above add a sentence, something like
In this minimal way we avoid all complications in relation to Unicode, and focus on those few we all agree, I think, cannot be used. All other punctuation (whether ASCII or Unicode), Unicode control and what not, remains as is, which basically means to be sorted out in the future. |
Beta Was this translation helpful? Give feedback.
-
I have now explored this in some more detail using a python script to insert various unicode characters into the variable name in a small .cdl file and then use ncgen to generate a .nc file. In the same script I used NCO/ncrename trying to change the same character of a variable name in a working nc-file to all other characters in the list, and then use ncdump to create a cdl file. Thus it is not a full round-trip because the NCO step. I focussed on ASCII (decimal 0 - 127), ISO/IEC 8859-1 (decimal 0 - 255) and control (C1), as well as Unicode whitespace (WS) groups (all according to Wikipedia). Here is the result:
In doing this I used the most recent released version of the netCDF library tools (netcdf library version 4.9.2 of Jun 6 2024 10:57:38). With respect to ASCII, I think that this is a pretty strong indication of which characters (groups) should not be not accepted in variable and attribute names. And, yes, I do think that it better to be explicit about this and expressly rule out those characters we know are likely to cause problems because the CF conventions are all about data exchange and interoperability. I think that it would be good to get such a statement into CF-1.12, what do you think ? ping @sethmcg @ChrisBarker-NOAA @JonathanGregory @ethanrd @Dave-Allured @DocOtak @davidhassell |
Beta Was this translation helpful? Give feedback.
-
Dear @larsbarring et al. Thanks for your thorough investigation, Lars, and thanks everyone for the discussion. The text which Lars quoted above is not the working version. Following conventions issue #237, section 2.3 now reads
which is consistent with the conformance document. That is, as Lars says, we recommend against a lot of characters. All characters except letters, digits, underscores and (for attributes only) ASCII 2D In the discussion of conventions #237 we agreed that all characters are allowed, despite the recommendation (which is not a requirement) not to use the majority of them. Lars commented that the CF conventions "essentially provide a whitelist of explicitly allowed characters. All other characters are not recommended (or recommended against) but not explicitly disallowed. But throughout this conversation there have been several remarks that some characters should indeed be explicitly disallowed. This could easily be done by ... creating a blacklist." That's what this discussion is about, if I understand correctly. The last sentence of the working text as above is unsatisfactory, despite #237, because it says The working text is also unsatisfactory because it implies that the NUG prohibits some characters ("it allows almost all Unicode characters ...") but it doesn't say which ones are not allowed. NUG Appendix B says that names should match the regular expression
I suppose we should understand the regular expression to begin with Since ASCII is a subset of UTF-8, I think that by "multibyte UTF-8 encoded", the NUG must mean a Unicode character which is encoded in more than one byte by UTF-8. That is, MUTF8 doesn't include one-byte characters, among them the ASCII characters 00-7F. Do you agree? If that's correct, the NUG does not allow I think we should explicitly state that we prohibit 00-1F, Also, the the CF working text is inconsistent with the NUG in saying "It is recommended that variable, dimension, attribute and group names begin with a letter". This is not merely a recommendation, because the NUG says that names must begin with a letter, digit, underscore or multi-byte UTF-8 character. We should fix this. Our text currently implies it's OK to start a name with a punctuation mark, for instance, which the NUG prohibits. Lars's experiment shows that I think it would be reasonable for CF to prohibit all those characters which We've decided to allow Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
Looking over this and the long original question. Is it worth separating variables into two categories: variables meant to be interpreted in a CF way, and variables that are not? I'm of the opinion that variable names basically don't matter and that all of the actual information is going to be inside the attribute values. I would propose that for variables that are intended to be interpreted as CF variables, we are very restrictive: ASCII letters I think that adding |
Beta Was this translation helpful? Give feedback.
-
A couple of further comments to my analysis and to the subsequent comments/responses:
However, allowing but not recommending all characters not explicitly disallowed by NUG is problematic for the following reasons:
I suggest that these four points should form the basis for creating a "blacklist" of characters that CF explicitly disallows despite that they are allowed by NUG. In principle this is a breaking change of what we previously agreed on in cf-conventions/#237, which still belong to the current draft version, and in practice the suggested list of characters to blacklist are typically not the ones one would expect to prime targets for users to include in new files. |
Beta Was this translation helpful? Give feedback.
-
On a partly different aspect, @JonathanGregory commented
I fully agree. The question is how to fix it. @DocOtak noted
which refers to this issue, and in particular this comment. Before we fix this particular sentence I think we should get some input regardingtheir views. I will shortly make comment over there . |
Beta Was this translation helpful? Give feedback.
-
I think the NUG special characters are the "blacklist" -- the rest could (should) be defined in terms of the Unicode categories: https://www.compart.com/en/unicode/category e.g. No control characters. (ASCII DEL is a control character) And maybe it's time now to specify which categories are allowed / not allowed? |
Beta Was this translation helpful? Give feedback.
-
Dear all
Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
Please note that @Dave-Allured has opened conventions issue 548 to delete the sentence, "ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only." in Sect 2.3. This sentence was inserted into the working version by conventions issue 477 for various reasons, including to support IETF BCP 47 language tags, discussed in conventions issue 528, which is still ongoing. If Dave's proposal is accepted, the characters allowed for attribute names will be the same as for variable names in CF 1.12, which is the same as in CF 1.11, the most recently reduced version. |
Beta Was this translation helpful? Give feedback.
-
Hi all - Sorry I'm late to this discussion. A few thoughts as I'm starting to catch up: Please DO NOT consider the NUG a reliable source for Unicode information. The sections that mention Unicode were written some time ago (2008) and without an in-depth understanding of Unicode. I do feel confident saying the intent at the time was that the names of all netCDF objects (dimension, variable, attribute, group, etc.) should be valid UTF-8 strings that are NFC normalized and do not contain any control characters. I believe the netCDF-C library validates that names are NFC normalized UTF-8 strings and without control characters (in the ASCII range) when creating a new netCDF dataset but not when reading (and maybe not when renaming). I believe the netCDF-Java library behaves in a simi8lar manner though I haven't tested as much. I agree with the comment above from Chris @ChrisBarker-NOAA about using Unicode categories (list) to specify allowed and/or not allowed characters. Also an earlier comment about reviewing other documents on Unicode for identifiers/names, e.g., how the Python Language defines the syntax for Identifiers. |
Beta Was this translation helpful? Give feedback.
-
There seems to be reasonable convergence towards creating a list of characters that CF explicitly disallows, and in doing so also updates the text regarding which characters CF explicitly recommends. If there is agreement, I can move on to create an issue over in the |
Beta Was this translation helpful? Give feedback.
-
Dear @larsbarring et al. @Dave-Allured mentions "the full allowed character set of the underlying file format". We were taking NUG as defining the character set, but Ethan's comment suggests that the NUG shouldn't be relied on concerning the character set. Therefore it seems even more important that CF should be crystal clear, both about what netCDF allows and disallows, and about any further restrictions of CF (grey or black). I have been searching the ncgen code to find the actual definition of characters allowed in new netCDF names (of dimensions, variables and attributes), and I think that it may be the one in https://github.com/Unidata/netcdf-c/blob/main/ncgen/ncgen.l, at lines 183 and 198. Of course, I won't be surprised if I've got the wrong lines! The ones in question are
Here,
Digression: to produce a backtick within backticks e.g.
For reading existing netCDF, I think the relevant file is https://github.com/Unidata/netcdf-c/blob/main/libdispatch/dstring.c, which perhaps was formerly called The rule for the first character is the same as for creation of new names. The rule for the following characters is the same as well except that space is allowed. For characters after the first, the code has
and its I hope that helps clarify the netCDF rules, which are our basis. Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
The netCDF rules are black and white; either a character is allowed in a certain position in a name or it's not, except for the slight difference between reading and writing. I suggest that the CF text should state the following as the netCDF rules. Assuming I've got those rules right (above), the intersection of what it allows for reading and writing is:
Having stated the netCDF rules we can compare them with what CF says. The first paragraph of Sect 2.3 is currently
The last sentence would be deleted by conventions issue 548. Please comment in that issue if you agree or disagree with removing it. Let's assume in the following that we delete it, so that the CF rules for attribute names are the same as for other names. As we've discussed already, CF has a greylist of characters which are permitted but not recommended. By comparison of the above with the netCDF rules, the greylist is:
This issue is considering whether we want to move any characters among the CF whitelist (NetCDF permitted minus CF greylist), greylist and blacklist (NetCDF prohibited). Lars has proposed or suggested that
I'm essentially restating what Lars has said, but perhaps it's clearer in the context of netCDF. @Dave-Allured disagrees with blacklisting any character allowed by netCDF. Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
Seth writes
Although this discussion is quite long, I think that stating the rules would not take much more space than it already does - a short extra paragraph at most, even if we decide to add to the blacklist as proposed by Lars. So I think it's worth keeping in sect 2.3, where it is now. Personally, I think it would be sensible to blacklist the "special2" and space-like characters, because it would support this one of the CF principles in 1.2:
Cheers, Jonathan |
Beta Was this translation helpful? Give feedback.
-
To be transparent about what I did in the tests, below is the python code (beware I am neither a developer, nor a pythonista...) and necessary files in one .zip file, and the resulting .cdl and .nc files in another .zip file. If you run the python code you will get a screen printout giving brief indication of the problems. I ran this in bash on a linux box, and I am not sure which of the problems encountered are only from with the ncgen/ncdump/ncrename, and what might be only to bash/linux, and what might be due to some interaction between them. Nevertheless, I do think that when recommending ("whitelisting") a set of Unicode characters we need to take into account problems related to any of these sources (irrespective of which operating system is used (some limits may apply...)) and not only what the netCDF-C library accepts. While I think we have not reached consensus it seems a majority is leaning towards adding an explicit blacklist. I am not sure if/how to move this discussion forward. test_cf_unicode_files.zip (~1 Mb) |
Beta Was this translation helpful? Give feedback.
-
@larsbarring how to move forward ... Some have suggested a condensed graylist version. Would you be able to post a condensed version, or is that too far away from your goal? |
Beta Was this translation helpful? Give feedback.
-
r.e.:
This conversation has drifted a bit to the larger issues (whitelist, greyilst, etc...) -- but I htink it is a good idea to tighten it for now, and yes, do a blacklist. and
Well, maybe, maybe not -- "the character set business" is a mess, and it does interact with many things of CF concern -- essentially its reason for being -- we want data sets to be easily human and machine readable, and unambiguous in their meaning -- and dealing, at some level, with "the character set business" furthers that goal. Related -- HDF5 and netcdf have defined that UTF-8 shall be the text encoding. That is a HUGE win over the any old encoding is OK world. I'm not sure what we are talking about is that different in concept. And it's easier to relax than tighten restrictions later. Back to the blacklist: In addition to the specific characters identified, specifically blacklisting a couple Unicode classes might make sense, e.g. what Jonathan wrote out a couple posts back, which I think is: Separator (Z): line (Zl), paragraph (Zp), space (Zs) Other (C): control (Cc), format (Cf), not assigned (Cn), private use (Co), surrogate (Cs) We may also want to limit some (or all) of: Mark (M): spacing combining (Mc), enclosing (Me), non-spacing (Mn) (Though I'm a touch confused by what those actually are) We should probably also mention in the same place that names should be NFC normalized (which is specified in the NUG) |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Dear Lars To summarise my earlier posting, I think we should replace the first paragraph of 2.3 along these lines: The NetCDF interface requires the following for the name of any variable, dimension, attribute and group:
In addition to the NetCDF requirements, in CF
and either
or
The or version is the status quo. Whether to adopt the either alternative is the main point at issue, I believe. Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
wait, really! WTF? that makes absolutely no sense, as I read it "an NFC-normalised Unicode codepoint encoded in UTF-8 and requiring more than one byte" -- that is EVERY code point in Unicode -- including punctuation, control characters, various whitespace, -- this list goes on. Very, very odd that they could disallow all the non letter and digit ASCII codepoints, but allow all the non-ascii ones -- Huh? I think it was Ethan that said that the netcdf handling of Unicode should not be considered thoughtful. Anyway -- we probably should bring this up with the netcdf folks, but in the meantime, CF can be more restrictive, and it absolutely should be. Perhaps we can re-define all this with a more appropriate extension from ASCII to Unicode -- e.g. "control code points are disallowed", "Letters are allowed" -- obviously spelled out in the proper language of Unicode. NOTE: this is distinct from the blacklist issue -- which I do support. |
Beta Was this translation helpful? Give feedback.
-
Side note: The search on the NUG here: https://docs.unidata.ucar.edu/nug/current/index.html is broken (I get a 404) for me. how do I report that? |
Beta Was this translation helpful? Give feedback.
-
Back to On Topic: with Google's help I found the relevant text in the NUG:
So that's the NUG text -- and its handling of the Unicode addition is odd (or poorly written, or ...) Perhaps what they mean by: "a multi-byte UTF-8 character" is actually: "A Unicode "Letter" character", i.e. (Lu | Ll | Lt) or maybe all L* code points? or ??? In any case, we certainly don't want control code points in there, and having no ASCII punctuation, but allow other punctuation as the first character makes no sense. and can a name start with a "combining lowline"? (https://unicode-explorer.com/c/0332) [1] Where would one go to suggest an update to the NUG? But in the meantime, CF can specify this all more clearly and precisely. Should we start a new discussion for that, and keep this one (re)focused on the Blacklist? [1] Just for fun -- here's an experiment:
Notice how when I print the name, it ends up combining leading lowline with the quote character -- fun! |
Beta Was this translation helpful? Give feedback.
-
Here is how to understand "multi-byte UTF-8 character" as used in the NUG. Their abbreviation is MUTF8. Today's UTF-8 includes byte sequences of 1, 2, 3, and 4 bytes. MUTF8 is ALL legal sequences, except for the 1-byte encodings. If you combine the single-byte sequences with MUTF8, you get the complete UTF-8 set. |
Beta Was this translation helpful? Give feedback.
-
"ALL legal sequences" -- fair enough, that's how I interpreted it too -- but allowing ALL of these, including as the leading character of a name, makes no sense at all. So this: "The first character of a name must be alphanumeric, a multi-byte UTF-8 character, or '_'" And as I parse it, you can't use any of the ASCII punctuation marks as a leading charactor, but you can use any non-ascii punctuation charactor, for instance. huh? In fact, you can use ANY non-ascii "character" -- including combining ones, whitespace, line feeds, other control characters, etc, etc. Really? If you are going to do that, why have any rules at all? This is reminiscent of the kerfuffle over Unicode as the core string type in Python3 -- the only really challenging problem was file names. (sure there were issues with existing mojibake's data, etc, but those were mostly surmountable). [I'll bring this around to the topic at hand, I promise] The big issue was that apparently on nix systems, filenames (paths, etc) are simply stored as a char, and the only special values are null and 47 (/ the ASCII forward slash). This all worked great in the ASCII days, and not too badly in the extended ANSI days (e.g. latin-1, etc, etc...). However, the result was that folks could use pretty much any encoding, all on the same file system, and there was no way to know what the encoding was for any given path. And all that is totally fine if all you need to do is pass a char* around, split on the slash, and maybe compare to other filenames. And that all worked fine in Python2, where a string was simply a null-terminated string of bytes (i.e. a char*). Enter Python3 and Unicode -- now you had to decode what's in the char* in order for Python to. be able to store it in a string. and that's not possible if you don't know the encoding. This was a very long kerfuffle -- with folks writing, e.g. unix utilities, saying, 'why can't I just pass around the pile of bytes? I don't care what characters they actually mean -- within the code, it's just a pile of bytes. And within the code, sure -- who cares? But what happens when you want to read that filename from a file? (or a web service) or write it to a text file, or show it to a person on the screen, or ?? The fact is, that outside of a computer program, filenames are text, and it's really helpful to have them be well described, human readable, etc... Back to the topic at hand: The NUG has selected utf-8 (and NFC normalization) so at least that's not a problem. And I can easily write code that can work with variable names, attribute. names, etc with any old code points in them -- (I use Python, so if it is valid utf-8, it can be decoded into a Python string, and I can do all. sorts of stuff with it -- no problem) -- other systems could work directly with the utf-8 encoded bytes. But for CF -- we want files to be both computer and human readable -- an ncdump of the file should be comprehendible (and not trash your terminal settings). And for THAT, it's a good idea to put some restrictions on allowable code points. BTW -- my idea to start another discussion was so that we could focus this one on only the Blacklist idea. |
Beta Was this translation helpful? Give feedback.
-
Topic for discussion
In #237 it was suggested to substantially relax restrictions on which characters are allowed in variable and attribute names. The conversation is still ongoing and sprinkled in various comments there are examples of characters that should not be allowed, either because they have special meaning in the context of CF or netCDF as such, or otherwise identified as causing problems.
I suggest that we amend the text in section 2.3 to list which character and character ranges CF explicitly disallows, i.e. creating a blacklist. While it may not be possible to identify all characters that should be in such a list (it may even evolve over time) I think that it is helpful to identify those characters that we now know belong to such a list.
So, far I believe the following have been identified from the standard ASCII character set: <
space
>, control characters (decimal 0 ... 31, 127),/
,:
,\
. This blacklist should probably be expanded to also include Unicode control and whitespaceand underscorecharacters.I addition, double underscores
__
have special meaning in relation to OGC netCDF-LD, specifically for prefixes, and should be mentioned as reserved for that purpose to not create interoperability clashes.Beta Was this translation helpful? Give feedback.
All reactions