Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do NFKC normalization in lexer #2253

Closed
catamorphism opened this issue Apr 19, 2012 · 15 comments
Closed

Do NFKC normalization in lexer #2253

catamorphism opened this issue Apr 19, 2012 · 15 comments
Labels
A-frontend Area: frontend (errors, parsing and HIR) A-unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. P-low Low priority

Comments

@catamorphism
Copy link
Contributor

next_token_inner in the lexer has a comment saying to do NKFC normalization. I have no idea what that is, but I guess we should do it.

reference: NFKC is one of four unicode Normalization Forms.

reference: UAX-31 supplies guidelines for use of normalization with identifiers.

@graydon
Copy link
Contributor

graydon commented Apr 20, 2012

Actually I've changed my mind on this and no longer think we should nfkc normalize here. But check the python unicode identifiers pep to confirm they do not either.

@marijnh
Copy link
Contributor

marijnh commented Apr 24, 2012

Python does normalize. See http://bugs.python.org/issue10952 for some relevant discussion.

@graydon
Copy link
Contributor

graydon commented Apr 24, 2012

Sadness!

And double-sadness on the filesystem thing. Though I don't really understand what the commenter is getting at in http://bugs.python.org/issue10952#msg126592 ... wouldn't the filename in the filesystem (however it's encoded) NFKC-normalize to the NFKC-normalized string the user asked for?

In any case, I think we can probably dodge whatever bullet is there (if there is one) due to us doing our linkage by relatively open-ended scanning plus matching metadata tags. I don't think it should ever be a bug to consider "too many crates" in the search path; including a directory in the search path should open up the possibility of the compiler looking at all crates in that dir as part of its scan.

@bblum
Copy link
Contributor

bblum commented Jul 3, 2013

revisiting for triage; all i have to add is the sparkling new A-unicode tag

@pnkfelix
Copy link
Member

visiting for triage, email from 2013-08-26.

(Added link to definition of NFKC in unicode spec.)

@pnkfelix
Copy link
Member

Regarding Graydon's question about the filesystem problem: My interpretation of the problem is that you have this scenario where:

  1. You have some string of characters, e.g. F which NFKC normalizes to F', where F != F'. (In the example, F = \xB5Torrent and F' = \u03BCTorrent
  2. The compiler sees an occurrence of F in the source code and normalizes it to F'
  3. But the filesystem is actually incapable of representing F' as a filename. In its internal representation of the directory structure, it has F.
  4. So when the compiler goes to look for the post-normalized F', it "cannot find it", since the filesystem does not (and will never) have an entry for F'.

Hypothetically I guess the compiler could attempt to invert the normalization process during the filesystem lookup and attempt to query the filesystem for every filename E such that E normalizes to F' ? (Maybe that's what Graydon meant when he said he did not understand the problem; maybe he was assuming that whatever component was doing a filesystem query should be able to walk over a relatively small set of candidate files and NFKC-normalize their names during the lookup?)

In any case, don't we always have the option of putting explicit names for the modules we import into attribute-strings, which (AFAIK) should not be NFKC-normalized? And therefore someone should always be able to workaround this problem if it arises for their particular filesystem? (Isn't that right? We're only talking here about NFKC-normalization of identifier tokens, not of literal strings, right? That's certainly the only place that the FIXME occurs in lexer.rs.)

(Also, the above scenario may just be an artificial problem. That argument was certainly made on that thread on Python Issue 10952.)


There is also discussion of NFKC/NFC normalization for identifiers, in the context of Scheme, here:

http://srfi.schemers.org/srfi-52/mail-archive/msg00019.html

@pnkfelix
Copy link
Member

My personal instinct has been to do NFC normalization, not NFKC. Yet it seems that experts in both the Python community and the Scheme community have argued for NFKC normalization of identifiers. (As did @nikomatsakis during the triage meeting, I think.)

Well, maybe for Scheme that makes more sense. At least on implementations that also case-fold; that was what came to my mind when I read the statement here:

http://bugs.python.org/issue10952#msg182911

According to that note, fullwidth and halfwidth variants are commonly used and recognized as different characters in Japanese environments, and the commenter was thinking it odd to distinguish upper- and lower-case while conflating full- and half-width. (Anyone here familiar with Japanese environments? I'd love to get better understanding of this.)

@pnkfelix
Copy link
Member

(also, added reference to UAX-31 to the description.)

@nikomatsakis
Copy link
Contributor

My personal instinct has been to do NFC normalization, not NFKC.
Yet it seems that experts in both the Python and the Scheme have
argued for NFKC normalization of identifiers. (As did @nikomatsakis
during the triage meeting, I think.)

I actually don't claim to argue either way. I feel like this is not
something I can have an opinion on, since all my identifiers are ASCII
and will almost certainly always be ASCII. But I can certainly see
an argument for identical glyphs being equivalent (I can also see an
argument for NFC).

@sanxiyn
Copy link
Member

sanxiyn commented Nov 22, 2013

I am not familiar with Japanese environment(tagging @gifnksm), but for Korean, my understanding is that it is desirable to have U+1100 HANGUL CHOSEONG KIYEOK and U+3131 HANGUL LETTER KIYEOK equivalent, which means compatibility normalization. The reason is that the same keypress will result in U+3131 in Windows but U+1100 in Mac.

@lilyball
Copy link
Contributor

lilyball commented Apr 8, 2014

My initial feeling was that we should do NFC and not NFKC, but fullwidth vs halfwidth, and @sanxiyn's example Korean characters, have changed my mind such that I think NFKC is the appropriate choice.

Regarding the filesystem issue, the only filesystem I know of offhand that normalizes is HFS+, which is fine because it normalizes to NFD and AFAIK Unicode isn't adding any new composed characters (which means HFS+ won't encounter a composed character that it doesn't know how to decompose). Although strictly speaking it normalizes to a variant of NFD that doesn't match current Unicode (I don't know if it's pegged at an older Unicode standard or merely has modifications), for backwards compatibility reasons.

Regardless, on HFS+ we shouldn't have an issue. Are there in fact filesystems out there that impose other character sets on filenames? If so, instead of trying to reverse the process (which seems implausible; besides not having a pre-existing mapping of NFC/NFD to all composed characters that could result in those (note: there are composed characters that will actually be decomposed by NFC), due to how normalization orders combining marks something like Zalgo would produce a combinatorial explosion of potential un-normalized names). Instead, if this is actually found to be an issue in practice, I would recommend simply reading the directory, passing all encountered files through NFKC normalization, and picking the file that matches. If multiple files match, consider that an error. Naturally this would only be done if there is no file that already matches the normalized identifier.

In any case, I think the right approach is to NFKC-normalize identifiers during lexing, and then if we don't want to solve the filesystem issue we can either introduce a lint or a feature-gate for non-ascii module names and extern crate names (as no other identifier will map to a filename).


Another point to consider is future compatibility for XID_Start/XID_Continue. There's 3 approaches we can take:

  1. Use XID_Start/XID_Continue from whatever the latest Unicode is. This will always be backwards-compatible, but it may result in newer versions of rustc accepting identifiers that older versions didn't (as new characters get added to these classes).
  2. Define our identifiers as XID_Start/XID_Continue from a specific version of Unicode. This will be future-compatible, but it means that as Unicode changes to add new characters to these classes, these characters will never be valid Rust identifiers.
    2a. Approach bind's glue function should tail-call its target  #2, but update the Unicode version whenever we make a backwards-incompatible language change that invalidates most programs (a minor language change that affects few programs is probably not sufficient, but something that requires modifying all but the most trivial of Rust files would be appropriate).
  3. Similar to approach bind's glue function should tail-call its target  #2. Define our identifiers as:
    • Any character in XID_Start/XID_Continue as of Unicode , OR
    • Any unassigned character as of Unicode that does not have any of the following properties:
      • Pattern_White_Space=True
      • Pattern_Syntax=True
      • General_Category=Private_Use, Surrogate, or Control
      • Noncharacter_Code_Point=True

These 3 approaches are all suggested by UAX #31 (Unicode Identifier and Pattern Syntax). Approach #3 is the most forward-compatible, but it has the downside of allowing identifiers that would otherwise be invalid.

#1 is obviously the easiest. And personally, I think #1 is probably ok. If you want to use identifiers characters added in future versions of Unicode in your project, you just have to be ok with requiring a minimum version of rustc to compile it. And if this is a big worry we can always add a lint to catch such newly-added characters.

@emberian
Copy link
Member

emberian commented Jul 3, 2014

Non-ascii identifiers are feature gated, I believe this to not be 1.0. Nominating.

@emberian
Copy link
Member

emberian commented Jul 3, 2014

And by not 1.0, I mean P-backcompat-lang.

@pnkfelix
Copy link
Member

pnkfelix commented Jul 3, 2014

Assigning P-low, not 1.0.

bors added a commit that referenced this issue Aug 5, 2014
The reference manual said that code is interpreted as UTF-8 text and a implementation will normalize it to NFKC. However, rustc doesn't do any normalization now.

We may want to do any normalization for symbols, but normalizing whole text seems harmful because doing so loses some sort of information even if we choose a non-K variant of normalization.

I'd suggest removing "normalized to Unicode normalization form NFKC" phrase for the present so that the manual represents the current state properly. When we address the problem (with a RFC?), then the manual should be updated.

Closes #12388.

Reference: #2253
@steveklabnik
Copy link
Member

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/rfcs#802

ilammy added a commit to ilammy/sabre that referenced this issue Nov 19, 2016
This is somewhat controversial thing, but I have made a decision:
identifiers should be normalized to fold visual ambiguity, and the
normalization form should be NFKC.

Rationale:

1. Compatibility decomposition is favored over canonical one because
   it provides useful folding for letter ligatures, fullwidth forms,
   certain CJK ideographs, etc.

2. Compatibility decomposition is favored over canonical one because
   it provides more protection from visual spoofing.

3. Standard Unicode transformation should be favored over anything
   ad-hoc because it's predictable and more mature.

4. Normalization is a compromise between freedom of expression and
   ease of implementation. Source code is not prose, there are rules.

Here are some references to other languages:

    SRFI 52: http://srfi-email.schemers.org/srfi-52/
    Julia:   JuliaLang/julia#5434
    Python:  http://bugs.python.org/issue10952
    Rust:    rust-lang/rust#2253

Unfortunately, there aren't very many precedents and open discussions
about Unicode usage in programming languages, especially in languages
with very permissive identifier syntax (like Scheme).

Aside from identifiers there are more places where Unicode can be used:

* Characters are not normalized, not even to NFC. This may have been
  useful, for example, to recompose combining marks, but unfortunately
  NFC may do more transformations than that, so it is no go. We preserve
  the exact Unicode character.

* Character names, on the other hand, are case-sensitive identifiers,
  so they are normalized as such.

* Strings and escaped identifiers are left untouched in order to preserve
  the exact spelling as in the source code.

* Directives are case-insensitive identifiers and are normalized as such.

* Numbers should be composed from ASCII only so they are not normalized.
  Sometimes this produces weird parses because characters that look like
  signs are not treated as such. However, these characters are invalid in
  numbers, so it's somewhat justified.

* Peculiar identifiers are shit. I'm sorry. Because of NFKC is is possible
  to write a plain, unescaped identifier that will parse as a number after
  going through NFKC. It may even look exactly like a number without being
  one. There is not much we can do about this, so we produce a warning
  just in case.

* Datum labels are mostly numbers, so they are not normalized as well.
  Note that sometimes they can be treated as numbers with invalid prefix.

* Comments are ignored.

* Delimiters should be ASCII-only. No discussion on this. Unicode has
  various fancy whitespaces and line separators, but this is source
  code, not a rich text document in a word processor.

Also, currently case-folding is performed only for ASCII range.
Identifiers should use NFKC_casefold transformation. It will be
implemented later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-frontend Area: frontend (errors, parsing and HIR) A-unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. P-low Low priority
Projects
None yet
Development

No branches or pull requests