Skip to content

Commit

Permalink
Normalize identifiers to NFKC
Browse files Browse the repository at this point in the history
This is somewhat controversial thing, but I have made a decision:
identifiers should be normalized to fold visual ambiguity, and the
normalization form should be NFKC.

Rationale:

1. Compatibility decomposition is favored over canonical one because
   it provides useful folding for letter ligatures, fullwidth forms,
   certain CJK ideographs, etc.

2. Compatibility decomposition is favored over canonical one because
   it provides more protection from visual spoofing.

3. Standard Unicode transformation should be favored over anything
   ad-hoc because it's predictable and more mature.

4. Normalization is a compromise between freedom of expression and
   ease of implementation. Source code is not prose, there are rules.

Here are some references to other languages:

    SRFI 52: http://srfi-email.schemers.org/srfi-52/
    Julia:   JuliaLang/julia#5434
    Python:  http://bugs.python.org/issue10952
    Rust:    rust-lang/rust#2253

Unfortunately, there aren't very many precedents and open discussions
about Unicode usage in programming languages, especially in languages
with very permissive identifier syntax (like Scheme).

Aside from identifiers there are more places where Unicode can be used:

* Characters are not normalized, not even to NFC. This may have been
  useful, for example, to recompose combining marks, but unfortunately
  NFC may do more transformations than that, so it is no go. We preserve
  the exact Unicode character.

* Character names, on the other hand, are case-sensitive identifiers,
  so they are normalized as such.

* Strings and escaped identifiers are left untouched in order to preserve
  the exact spelling as in the source code.

* Directives are case-insensitive identifiers and are normalized as such.

* Numbers should be composed from ASCII only so they are not normalized.
  Sometimes this produces weird parses because characters that look like
  signs are not treated as such. However, these characters are invalid in
  numbers, so it's somewhat justified.

* Peculiar identifiers are shit. I'm sorry. Because of NFKC is is possible
  to write a plain, unescaped identifier that will parse as a number after
  going through NFKC. It may even look exactly like a number without being
  one. There is not much we can do about this, so we produce a warning
  just in case.

* Datum labels are mostly numbers, so they are not normalized as well.
  Note that sometimes they can be treated as numbers with invalid prefix.

* Comments are ignored.

* Delimiters should be ASCII-only. No discussion on this. Unicode has
  various fancy whitespaces and line separators, but this is source
  code, not a rich text document in a word processor.

Also, currently case-folding is performed only for ASCII range.
Identifiers should use NFKC_casefold transformation. It will be
implemented later.
  • Loading branch information
ilammy committed Nov 19, 2016
1 parent a78b136 commit 505faba
Show file tree
Hide file tree
Showing 3 changed files with 456 additions and 123 deletions.
3 changes: 3 additions & 0 deletions src/libreader/diagnostics.rs
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,9 @@ pub enum DiagnosticKind {
/// Lexer has scanned over a character that is not allowed in identifers.
err_lexer_invalid_identifier_character,

/// Lexer has scanned over a plain identifier which looks like a number after normalization.
warn_lexer_identifier_looks_like_number,

/// Lexer has scanned over an unrecognized directive.
err_lexer_unknown_directive,

Expand Down
269 changes: 170 additions & 99 deletions src/libreader/lexer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -566,8 +566,9 @@ impl<'a> StringScanner<'a> {
return Token::Character(first_character);
}

// Match character names case-insensitively if the user has requested #!fold-case before.
let name = if self.folding_case { fold_identifier_case(name) } else { name.to_owned() };
// Normalize character names before matching (this is important mostly for the case when
// #!fold-case is in effect).
let name = self.normalize_identifer(name);

// Finally, handle named character literals.
match &name[..] {
Expand Down Expand Up @@ -870,8 +871,8 @@ impl<'a> StringScanner<'a> {
}
let name_end = self.prev_pos;

// Directives are case-insensitive.
let value = fold_identifier_case(&self.buf[name_start..name_end]);
// Directives are matched as case-insensitive identifiers.
let value = normalize_case_insensitive_identifier(&self.buf[name_start..name_end]);

match &value[..] {
// Handle case-folding directives.
Expand Down Expand Up @@ -922,8 +923,15 @@ impl<'a> StringScanner<'a> {

let value = &self.buf[start..end];

// Fold case in identifiers if the user has requested #!fold-case before.
let value = if self.folding_case { fold_identifier_case(value) } else { value.to_owned() };
// Normalize identifier spelling and handle `#!fold-case` mode.
let value = self.normalize_identifer(value);

// Warn the user if the identifier can be parsed as a number after normalization, maybe
// they intended to write a number, but, for example, their editor used fullwidth digits.
if looks_like_number_prefix(&value, 10) {
self.diagnostic.report(DiagnosticKind::warn_lexer_identifier_looks_like_number,
Span::new(start, end));
}

return Token::Identifier(self.pool.intern_string(value));
}
Expand Down Expand Up @@ -969,7 +977,8 @@ impl<'a> StringScanner<'a> {
}
}

// Note that escaped identifiers are always case-sensitive.
// Note that escaped identifiers are always verbatim. They are case-sensitive and they are
// not normalized in any way.
return Token::Identifier(self.pool.intern_string(value));
}

Expand Down Expand Up @@ -1212,82 +1221,7 @@ impl<'a> StringScanner<'a> {

/// Check whether we were fooled and there is a peculiar identifier ahead instead of a number.
fn peculiar_identifier_ahead(&self, radix: u8) -> bool {
// Now, the whole notion of peculiar identifiers is abominable. That's what you get when
// you make identifier syntax as permissive as it is in Scheme. Well, whatever. This is
// just another hysterical raisin to deal with.
//
// Here is the formal grammar of peculiar identifiers from R7RS:
//
// <peculiar-identifier> ::= <explicit-sign> (1)
// | <explicit-sign> <sign-subsequent> <subsequent>* (2)
// | <explicit-sign> . <dot-subsequent> <subsequent>* (3)
// | . <dot-subsequent> <subsequent>* (4)
//
// <explicit-sign> ::= + | -
//
// <sign-subsequent> ::= <initial> | <explicit-sign> | @
//
// <dot-subsequent> ::= <sign-subsequent> | .
//
// where <subsequent> can be thought of as effectively anything except for <delimiter>,
// and <initial> is <subsequent> minus decimal digits and [+-.@]
//
// And there is an exception: +i, -i, +inf.0, -inf.0, +nan.0, -nan.0 (case-insensitive)
// are parsed as numbers, not peculiar identifiers. Note that they must match exactly.
//
// One wacky point about these exceptions, which is not immediately obvious from the
// formal grammar and R7RS commentary, is whether a string like "+inf.0+40000monkeys"
// is a valid peculiar identifer, or an invalid number literal, or something else.
//
// This could be a peculiar identifier as it matches its grammar and does not match
// the grammar of numbers. However, there is another rule that says that an identifier
// cannot have a valid number as its prefix, and in our case we have this "+inf.0".
// So we should treat this string as ungrammatical and we can report anything we want.
// But that rule also invalidates all identifiers starting with "+i", just this one
// special letter. Which is obviously dumb.
//
// Thus we take the following stance on this issue: if we see /[+-](inf|nan).0/i
// as a prefix then this is *not* a peculiar identifier. Otherwise it is, except
// for the exact case /[+-]i[:delimiter:]/i which is a number.

let first = self.cur;
let second = self.peek();
let third = self.peekpeek();

// Cases (1), (2), (3)
if first == Some('+') || first == Some('-') {
// Case (1)
if second.map_or(true, is_delimiter) {
return true;
}

// Case (3)
if second == Some('.') {
return third.map_or(true, |c| is_delimiter(c) || !is_digit(radix, c));
}

// Case (2) with exceptions
if second.map_or(true, |c| !is_digit(radix, c)) {
if self.ahead_is("inf.0") || self.ahead_is("nan.0") {
return false;
}
if self.ahead_is("i") && third.map_or(true, is_delimiter) {
return false;
}
return true;
}

// Anything starting with a sign followed by a digit must be a number.
return false;
}

// Case (4)
if first == Some('.') {
return second.map_or(true, |c| !is_delimiter(c) && !is_digit(radix, c));
}

// Anything starting with a digit must be a number.
return first.map_or(true, |c| !is_digit(radix, c));
!looks_like_number_prefix(&self.buf[self.prev_pos..], radix)
}

/// Scan a real part of a number (an integer, a fractional, or a rational number).
Expand Down Expand Up @@ -1616,27 +1550,13 @@ impl<'a> StringScanner<'a> {

/// Check whether the lookahead has a given ASCII prefix ignoring case.
fn ahead_is(&self, prefix: &str) -> bool {
use std::ascii::AsciiExt;

assert!(prefix.chars().all(|c| c.is_ascii() && c == c.to_ascii_lowercase()));

self.buf[self.pos..]
.chars()
.map(|c| c.to_ascii_lowercase())
.take(prefix.len())
.eq(prefix.chars())
has_ascii_prefix_ci(&self.buf[self.pos..], prefix)
}

/// Check whether the lookahead has a given ASCII prefix ignoring case,
/// and a delimiter goes right after that.
fn ahead_is_exactly(&self, prefix: &str) -> bool {
if !self.ahead_is(prefix) {
return false;
}
return self.buf[self.pos..]
.chars()
.nth(prefix.len())
.map_or(true, is_delimiter);
has_ascii_prefix_ci_exact(&self.buf[self.pos..], prefix)
}

/// Scan over an (invalid) suffix of an `+inf.0` or `-nan.0` value.
Expand Down Expand Up @@ -1690,6 +1610,25 @@ impl<'a> StringScanner<'a> {
Span::new(suffix_start, suffix_end));
}
}

/// Normalize an identifier name.
///
/// We normalize textual identifier names to NFKC (compatibility composition) in order to
/// fold any visual ambiguities resulting from very permissive identifier syntax. We also
/// remove any Default_Ignorable_Code_Points.
///
/// When `#!fold-case` is in effect we apply NFKC_Casefold transformation which simultaneously
/// folds case, normalizes text to NFKC, and removes Default_Ignorable_Code_Points.
///
/// As noted in the discussion of SRFI 52, NFKC is not an ideal transformation, but it is
/// a standardized best practice, so we go with it.
fn normalize_identifer(&self, s: &str) -> String {
if self.folding_case {
normalize_case_insensitive_identifier(s)
} else {
normalize_case_sensitive_identifier(s)
}
}
}

// - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Expand Down Expand Up @@ -1783,6 +1722,12 @@ fn is_identifier_subsequent(c: char) -> bool {
/// A replacement character used when we need to return a character, but don't have one.
const REPLACEMENT_CHARACTER: char = '\u{FFFD}';

/// U+200C ZERO WIDTH NON-JOINER
const ZERO_WIDTH_NON_JOINER: char = '\u{200C}';

/// U+200D ZERO WIDTH JOINER
const ZERO_WIDTH_JOINER: char = '\u{200D}';

/// Check whether `c` is a valid digit of base `base`.
fn is_digit(base: u8, c: char) -> bool {
match base {
Expand Down Expand Up @@ -1810,6 +1755,25 @@ fn hex_value(c: char) -> u8 {
return H[c as usize];
}

/// Normalize a string as a case-sensitive identifier.
fn normalize_case_sensitive_identifier(s: &str) -> String {
use unicode::normalization;

let normalized = normalization::nfkc(s);

// Scheme identifier syntax permits only ZWNJ and ZWJ to occur in valid identifiers,
// so we remove only these (and do not need a table of Default_Ignorable_Code_Points).
normalized.chars()
.filter(|&c| !(c == ZERO_WIDTH_NON_JOINER || c == ZERO_WIDTH_JOINER))
.collect()
}

/// Normalize a string as a case-insensitive identifier.
fn normalize_case_insensitive_identifier(s: &str) -> String {
// TODO: use NFKC_casefold instead of this
fold_identifier_case(&normalize_case_sensitive_identifier(s))
}

/// Apply case-folding to a directive or an identifier.
///
/// Note that this is _not_ a general case-folding procedure.
Expand All @@ -1819,3 +1783,110 @@ fn fold_identifier_case(s: &str) -> String {
// We support only ASCII now, so this will do:
return s.to_ascii_lowercase();
}

/// Check whether a string should be parsed as a number or it can be glanced off as an identifier
/// (possibly peculiar one). Returns true if it is a number.
fn looks_like_number_prefix(s: &str, radix: u8) -> bool {
// Now, the whole notion of peculiar identifiers is abominable. That's what you get when
// you make identifier syntax as permissive as it is in Scheme. Well, whatever. This is
// just another hysterical raisin to deal with.
//
// Here is the formal grammar of peculiar identifiers from R7RS:
//
// <peculiar-identifier> ::= <explicit-sign> (1)
// | <explicit-sign> <sign-subsequent> <subsequent>* (2)
// | <explicit-sign> . <dot-subsequent> <subsequent>* (3)
// | . <dot-subsequent> <subsequent>* (4)
//
// <explicit-sign> ::= + | -
//
// <sign-subsequent> ::= <initial> | <explicit-sign> | @
//
// <dot-subsequent> ::= <sign-subsequent> | .
//
// where <subsequent> can be thought of as effectively anything except for <delimiter>,
// and <initial> is <subsequent> minus decimal digits and [+-.@]
//
// And there is an exception: +i, -i, +inf.0, -inf.0, +nan.0, -nan.0 (case-insensitive)
// are parsed as numbers, not peculiar identifiers. Note that they must match exactly.
//
// One wacky point about these exceptions, which is not immediately obvious from the
// formal grammar and R7RS commentary, is whether a string like "+inf.0+40000monkeys"
// is a valid peculiar identifer, or an invalid number literal, or something else.
//
// This could be a peculiar identifier as it matches its grammar and does not match
// the grammar of numbers. However, there is another rule that says that an identifier
// cannot have a valid number as its prefix, and in our case we have this "+inf.0".
// So we should treat this string as ungrammatical and we can report anything we want.
// But that rule also invalidates all identifiers starting with "+i", just this one
// special letter. Which is obviously dumb.
//
// Thus we take the following stance on this issue: if we see /[+-](inf|nan).0/i
// as a prefix then this is *not* a peculiar identifier. Otherwise it is, except
// for the exact case /[+-]i[:delimiter:]/i which is a number.

let first = s.chars().nth(0);
let second = s.chars().nth(1);
let third = s.chars().nth(2);

// Cases (1), (2), (3)
if first == Some('+') || first == Some('-') {
// Case (1)
if second.map_or(true, is_delimiter) {
return false;
}

// Case (3)
if second == Some('.') {
return third.map_or(false, |c| !is_delimiter(c) && is_digit(radix, c));
}

// Case (2) with exceptions
if second.map_or(true, |c| !is_digit(radix, c)) {
let after_sign = &s[1..];
if has_ascii_prefix_ci(after_sign, "inf.0") {
return true;
}
if has_ascii_prefix_ci(after_sign, "nan.0") {
return true;
}
if has_ascii_prefix_ci_exact(after_sign, "i") {
return true;
}
return false;
}

// Anything starting with a sign followed by a digit must be a number.
return true;
}

// Case (4)
if first == Some('.') {
return second.map_or(false, |c| is_delimiter(c) || is_digit(radix, c));
}

// Anything starting with a digit must be a number.
return first.map_or(false, |c| is_digit(radix, c));
}

/// Return true if `s` has a given ASCII prefix ignoring case.
fn has_ascii_prefix_ci(s: &str, prefix: &str) -> bool {
use std::ascii::AsciiExt;

assert!(prefix.chars().all(|c| c.is_ascii() && c == c.to_ascii_lowercase()));

s.chars()
.map(|c| c.to_ascii_lowercase())
.take(prefix.len())
.eq(prefix.chars())
}

/// Return true if `s` has a given ASCII prefix ignoring case, and a delimiter after it.
fn has_ascii_prefix_ci_exact(s: &str, prefix: &str) -> bool {
if !has_ascii_prefix_ci(s, prefix) {
return false;
}
return s.chars()
.nth(prefix.len())
.map_or(true, is_delimiter);
}
Loading

0 comments on commit 505faba

Please sign in to comment.