-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This is somewhat controversial thing, but I have made a decision: identifiers should be normalized to fold visual ambiguity, and the normalization form should be NFKC. Rationale: 1. Compatibility decomposition is favored over canonical one because it provides useful folding for letter ligatures, fullwidth forms, certain CJK ideographs, etc. 2. Compatibility decomposition is favored over canonical one because it provides more protection from visual spoofing. 3. Standard Unicode transformation should be favored over anything ad-hoc because it's predictable and more mature. 4. Normalization is a compromise between freedom of expression and ease of implementation. Source code is not prose, there are rules. Here are some references to other languages: SRFI 52: http://srfi-email.schemers.org/srfi-52/ Julia: JuliaLang/julia#5434 Python: http://bugs.python.org/issue10952 Rust: rust-lang/rust#2253 Unfortunately, there aren't very many precedents and open discussions about Unicode usage in programming languages, especially in languages with very permissive identifier syntax (like Scheme). Aside from identifiers there are more places where Unicode can be used: * Characters are not normalized, not even to NFC. This may have been useful, for example, to recompose combining marks, but unfortunately NFC may do more transformations than that, so it is no go. We preserve the exact Unicode character. * Character names, on the other hand, are case-sensitive identifiers, so they are normalized as such. * Strings and escaped identifiers are left untouched in order to preserve the exact spelling as in the source code. * Directives are case-insensitive identifiers and are normalized as such. * Numbers should be composed from ASCII only so they are not normalized. Sometimes this produces weird parses because characters that look like signs are not treated as such. However, these characters are invalid in numbers, so it's somewhat justified. * Peculiar identifiers are shit. I'm sorry. Because of NFKC is is possible to write a plain, unescaped identifier that will parse as a number after going through NFKC. It may even look exactly like a number without being one. There is not much we can do about this, so we produce a warning just in case. * Datum labels are mostly numbers, so they are not normalized as well. Note that sometimes they can be treated as numbers with invalid prefix. * Comments are ignored. * Delimiters should be ASCII-only. No discussion on this. Unicode has various fancy whitespaces and line separators, but this is source code, not a rich text document in a word processor. Also, currently case-folding is performed only for ASCII range. Identifiers should use NFKC_casefold transformation. It will be implemented later.
- Loading branch information
Showing
3 changed files
with
456 additions
and
123 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.