Do NFKC normalization in lexer #2253

catamorphism · 2012-04-19T22:13:16Z

next_token_inner in the lexer has a comment saying to do NKFC normalization. I have no idea what that is, but I guess we should do it.

reference: NFKC is one of four unicode Normalization Forms.

reference: UAX-31 supplies guidelines for use of normalization with identifiers.

The text was updated successfully, but these errors were encountered:

graydon · 2012-04-20T07:17:29Z

Actually I've changed my mind on this and no longer think we should nfkc normalize here. But check the python unicode identifiers pep to confirm they do not either.

marijnh · 2012-04-24T08:50:33Z

Python does normalize. See http://bugs.python.org/issue10952 for some relevant discussion.

graydon · 2012-04-24T18:33:12Z

Sadness!

And double-sadness on the filesystem thing. Though I don't really understand what the commenter is getting at in http://bugs.python.org/issue10952#msg126592 ... wouldn't the filename in the filesystem (however it's encoded) NFKC-normalize to the NFKC-normalized string the user asked for?

In any case, I think we can probably dodge whatever bullet is there (if there is one) due to us doing our linkage by relatively open-ended scanning plus matching metadata tags. I don't think it should ever be a bug to consider "too many crates" in the search path; including a directory in the search path should open up the possibility of the compiler looking at all crates in that dir as part of its scan.

bblum · 2013-07-03T21:21:57Z

revisiting for triage; all i have to add is the sparkling new A-unicode tag

pnkfelix · 2013-08-30T10:08:04Z

visiting for triage, email from 2013-08-26.

(Added link to definition of NFKC in unicode spec.)

pnkfelix · 2013-09-27T11:25:33Z

Regarding Graydon's question about the filesystem problem: My interpretation of the problem is that you have this scenario where:

You have some string of characters, e.g. F which NFKC normalizes to F', where F != F'. (In the example, F = \xB5Torrent and F' = \u03BCTorrent
The compiler sees an occurrence of F in the source code and normalizes it to F'
But the filesystem is actually incapable of representing F' as a filename. In its internal representation of the directory structure, it has F.
So when the compiler goes to look for the post-normalized F', it "cannot find it", since the filesystem does not (and will never) have an entry for F'.

Hypothetically I guess the compiler could attempt to invert the normalization process during the filesystem lookup and attempt to query the filesystem for every filename E such that E normalizes to F' ? (Maybe that's what Graydon meant when he said he did not understand the problem; maybe he was assuming that whatever component was doing a filesystem query should be able to walk over a relatively small set of candidate files and NFKC-normalize their names during the lookup?)

In any case, don't we always have the option of putting explicit names for the modules we import into attribute-strings, which (AFAIK) should not be NFKC-normalized? And therefore someone should always be able to workaround this problem if it arises for their particular filesystem? (Isn't that right? We're only talking here about NFKC-normalization of identifier tokens, not of literal strings, right? That's certainly the only place that the FIXME occurs in lexer.rs.)

(Also, the above scenario may just be an artificial problem. That argument was certainly made on that thread on Python Issue 10952.)

There is also discussion of NFKC/NFC normalization for identifiers, in the context of Scheme, here:

http://srfi.schemers.org/srfi-52/mail-archive/msg00019.html

pnkfelix · 2013-09-27T11:35:31Z

My personal instinct has been to do NFC normalization, not NFKC. Yet it seems that experts in both the Python community and the Scheme community have argued for NFKC normalization of identifiers. (As did @nikomatsakis during the triage meeting, I think.)

Well, maybe for Scheme that makes more sense. At least on implementations that also case-fold; that was what came to my mind when I read the statement here:

http://bugs.python.org/issue10952#msg182911

According to that note, fullwidth and halfwidth variants are commonly used and recognized as different characters in Japanese environments, and the commenter was thinking it odd to distinguish upper- and lower-case while conflating full- and half-width. (Anyone here familiar with Japanese environments? I'd love to get better understanding of this.)

pnkfelix · 2013-09-27T11:44:32Z

(also, added reference to UAX-31 to the description.)

nikomatsakis · 2013-09-27T15:50:05Z

My personal instinct has been to do NFC normalization, not NFKC.
Yet it seems that experts in both the Python and the Scheme have
argued for NFKC normalization of identifiers. (As did @nikomatsakis
during the triage meeting, I think.)

I actually don't claim to argue either way. I feel like this is not
something I can have an opinion on, since all my identifiers are ASCII
and will almost certainly always be ASCII. But I can certainly see
an argument for identical glyphs being equivalent (I can also see an
argument for NFC).

sanxiyn · 2013-11-22T06:08:29Z

I am not familiar with Japanese environment(tagging @gifnksm), but for Korean, my understanding is that it is desirable to have U+1100 HANGUL CHOSEONG KIYEOK and U+3131 HANGUL LETTER KIYEOK equivalent, which means compatibility normalization. The reason is that the same keypress will result in U+3131 in Windows but U+1100 in Mac.

lilyball · 2014-04-08T23:41:32Z

My initial feeling was that we should do NFC and not NFKC, but fullwidth vs halfwidth, and @sanxiyn's example Korean characters, have changed my mind such that I think NFKC is the appropriate choice.

Regarding the filesystem issue, the only filesystem I know of offhand that normalizes is HFS+, which is fine because it normalizes to NFD and AFAIK Unicode isn't adding any new composed characters (which means HFS+ won't encounter a composed character that it doesn't know how to decompose). Although strictly speaking it normalizes to a variant of NFD that doesn't match current Unicode (I don't know if it's pegged at an older Unicode standard or merely has modifications), for backwards compatibility reasons.

Regardless, on HFS+ we shouldn't have an issue. Are there in fact filesystems out there that impose other character sets on filenames? If so, instead of trying to reverse the process (which seems implausible; besides not having a pre-existing mapping of NFC/NFD to all composed characters that could result in those (note: there are composed characters that will actually be decomposed by NFC), due to how normalization orders combining marks something like Zalgo would produce a combinatorial explosion of potential un-normalized names). Instead, if this is actually found to be an issue in practice, I would recommend simply reading the directory, passing all encountered files through NFKC normalization, and picking the file that matches. If multiple files match, consider that an error. Naturally this would only be done if there is no file that already matches the normalized identifier.

In any case, I think the right approach is to NFKC-normalize identifiers during lexing, and then if we don't want to solve the filesystem issue we can either introduce a lint or a feature-gate for non-ascii module names and extern crate names (as no other identifier will map to a filename).

Another point to consider is future compatibility for XID_Start/XID_Continue. There's 3 approaches we can take:

Use XID_Start/XID_Continue from whatever the latest Unicode is. This will always be backwards-compatible, but it may result in newer versions of rustc accepting identifiers that older versions didn't (as new characters get added to these classes).
Define our identifiers as XID_Start/XID_Continue from a specific version of Unicode. This will be future-compatible, but it means that as Unicode changes to add new characters to these classes, these characters will never be valid Rust identifiers.
2a. Approach bind's glue function should tail-call its target #2, but update the Unicode version whenever we make a backwards-incompatible language change that invalidates most programs (a minor language change that affects few programs is probably not sufficient, but something that requires modifying all but the most trivial of Rust files would be appropriate).
Similar to approach bind's glue function should tail-call its target #2. Define our identifiers as:
- Any character in XID_Start/XID_Continue as of Unicode , OR
- Any unassigned character as of Unicode that does not have any of the following properties:
  - Pattern_White_Space=True
  - Pattern_Syntax=True
  - General_Category=Private_Use, Surrogate, or Control
  - Noncharacter_Code_Point=True

These 3 approaches are all suggested by UAX #31 (Unicode Identifier and Pattern Syntax). Approach #3 is the most forward-compatible, but it has the downside of allowing identifiers that would otherwise be invalid.

#1 is obviously the easiest. And personally, I think #1 is probably ok. If you want to use identifiers characters added in future versions of Unicode in your project, you just have to be ok with requiring a minimum version of rustc to compile it. And if this is a big worry we can always add a lint to catch such newly-added characters.

emberian · 2014-07-03T05:10:13Z

Non-ascii identifiers are feature gated, I believe this to not be 1.0. Nominating.

emberian · 2014-07-03T21:03:23Z

And by not 1.0, I mean P-backcompat-lang.

pnkfelix · 2014-07-03T21:03:49Z

Assigning P-low, not 1.0.

The reference manual said that code is interpreted as UTF-8 text and a implementation will normalize it to NFKC. However, rustc doesn't do any normalization now. We may want to do any normalization for symbols, but normalizing whole text seems harmful because doing so loses some sort of information even if we choose a non-K variant of normalization. I'd suggest removing "normalized to Unicode normalization form NFKC" phrase for the present so that the manual represents the current state properly. When we address the problem (with a RFC?), then the manual should be updated. Closes #12388. Reference: #2253

steveklabnik · 2015-02-02T17:57:31Z

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/rfcs#802

This is somewhat controversial thing, but I have made a decision: identifiers should be normalized to fold visual ambiguity, and the normalization form should be NFKC. Rationale: 1. Compatibility decomposition is favored over canonical one because it provides useful folding for letter ligatures, fullwidth forms, certain CJK ideographs, etc. 2. Compatibility decomposition is favored over canonical one because it provides more protection from visual spoofing. 3. Standard Unicode transformation should be favored over anything ad-hoc because it's predictable and more mature. 4. Normalization is a compromise between freedom of expression and ease of implementation. Source code is not prose, there are rules. Here are some references to other languages: SRFI 52: http://srfi-email.schemers.org/srfi-52/ Julia: JuliaLang/julia#5434 Python: http://bugs.python.org/issue10952 Rust: rust-lang/rust#2253 Unfortunately, there aren't very many precedents and open discussions about Unicode usage in programming languages, especially in languages with very permissive identifier syntax (like Scheme). Aside from identifiers there are more places where Unicode can be used: * Characters are not normalized, not even to NFC. This may have been useful, for example, to recompose combining marks, but unfortunately NFC may do more transformations than that, so it is no go. We preserve the exact Unicode character. * Character names, on the other hand, are case-sensitive identifiers, so they are normalized as such. * Strings and escaped identifiers are left untouched in order to preserve the exact spelling as in the source code. * Directives are case-insensitive identifiers and are normalized as such. * Numbers should be composed from ASCII only so they are not normalized. Sometimes this produces weird parses because characters that look like signs are not treated as such. However, these characters are invalid in numbers, so it's somewhat justified. * Peculiar identifiers are shit. I'm sorry. Because of NFKC is is possible to write a plain, unescaped identifier that will parse as a number after going through NFKC. It may even look exactly like a number without being one. There is not much we can do about this, so we produce a warning just in case. * Datum labels are mostly numbers, so they are not normalized as well. Note that sometimes they can be treated as numbers with invalid prefix. * Comments are ignored. * Delimiters should be ASCII-only. No discussion on this. Unicode has various fancy whitespaces and line separators, but this is source code, not a rich text document in a word processor. Also, currently case-folding is performed only for ASCII range. Identifiers should use NFKC_casefold transformation. It will be implemented later.

fix behavior of only-Nbits comments Fixes rust-lang/miri#2206

bblum mentioned this issue Jul 3, 2013

Ident mangling and unicode. #7539

Closed

pnkfelix mentioned this issue Sep 27, 2013

Paths and filenames are not necessarily UTF-8 #7225

Closed

pnkfelix mentioned this issue Feb 19, 2014

[RFC] Remove "normalized to NFKC" clause from the reference manual, section 3.1 #12388

Closed

lilyball mentioned this issue Apr 8, 2014

Non-ASCII Crate IDs do not work #13412

Closed

emberian added the I-nominated label Jul 3, 2014

pnkfelix added P-low and removed P-backcompat-lang labels Jul 3, 2014

alexcrichton removed the I-nominated label Jul 3, 2014

omasanori mentioned this issue Aug 3, 2014

Remove the "NFKC clause" in the reference manual. #16216

Merged

steveklabnik mentioned this issue Feb 2, 2015

Do NFKC normalization in lexer rust-lang/rfcs#802

Closed

steveklabnik closed this as completed Feb 2, 2015

petrochenkov mentioned this issue Oct 30, 2015

Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") #28979

Closed

oli-obk mentioned this issue Sep 6, 2017

Meta issue for FIXMEs that reference closed issues #44366

Open

indolering mentioned this issue Nov 20, 2020

Deterministic/Reproducible Directory Listing WebAssembly/wasi-filesystem#12

Closed

bors added a commit to rust-lang-ci/rust that referenced this issue Sep 22, 2022

Auto merge of rust-lang#2253 - RalfJung:only-bits, r=oli-obk

f664196

fix behavior of only-Nbits comments Fixes rust-lang/miri#2206

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do NFKC normalization in lexer #2253

Do NFKC normalization in lexer #2253

catamorphism commented Apr 19, 2012

graydon commented Apr 20, 2012

marijnh commented Apr 24, 2012

graydon commented Apr 24, 2012

bblum commented Jul 3, 2013

pnkfelix commented Aug 30, 2013

pnkfelix commented Sep 27, 2013

pnkfelix commented Sep 27, 2013

pnkfelix commented Sep 27, 2013

nikomatsakis commented Sep 27, 2013

sanxiyn commented Nov 22, 2013

lilyball commented Apr 8, 2014

emberian commented Jul 3, 2014

emberian commented Jul 3, 2014

pnkfelix commented Jul 3, 2014

steveklabnik commented Feb 2, 2015