From a73cd6fef02d16483cb5f9c1f7f6d81db3c46584 Mon Sep 17 00:00:00 2001
From: Lee Byron <lee@leebyron.com>
Date: Fri, 10 Jan 2020 12:50:12 -0800
Subject: [PATCH] Clarify lexing is greedy with lookahead restrictions. (#599)

GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings.

This also removes regular expression representation from the lexical grammar notation, since it wasn't always clear.

Either way, the additional clarity removes ambiguity from the spec

Partial fix for #564

Specifically addresses https://github.com/graphql/graphql-spec/pull/564#issuecomment-508714529
---
 spec/Appendix A -- Notation Conventions.md |  46 ++++--
 spec/Appendix B -- Grammar Summary.md      |  48 ++++--
 spec/Section 2 -- Language.md              | 165 ++++++++++++++++-----
 3 files changed, 192 insertions(+), 67 deletions(-)

diff --git a/spec/Appendix A -- Notation Conventions.md b/spec/Appendix A -- Notation Conventions.md
index cbb8e8a3a..14d55fc70 100644
--- a/spec/Appendix A -- Notation Conventions.md	
+++ b/spec/Appendix A -- Notation Conventions.md	
@@ -22,8 +22,10 @@ of the sequences it is defined by, until all non-terminal symbols have been
 replaced by terminal characters.
 
 Terminals are represented in this document in a monospace font in two forms: a
-specific Unicode character or sequence of Unicode characters (ex. {`=`} or {`terminal`}), and a pattern of Unicode characters defined by a regular expression
-(ex {/[0-9]+/}).
+specific Unicode character or sequence of Unicode characters (ie. {`=`} or
+{`terminal`}), and prose typically describing a specific Unicode code-point
+{"Space (U+0020)"}. Sequences of Unicode characters only appear in syntactic
+grammars and represent a {Name} token of that specific sequence.
 
 Non-terminal production rules are represented in this document using the
 following notation for a non-terminal with a single definition:
@@ -48,23 +50,25 @@ ListOfLetterA :
 
 The GraphQL language is defined in a syntactic grammar where terminal symbols
 are tokens. Tokens are defined in a lexical grammar which matches patterns of
-source characters. The result of parsing a sequence of source Unicode characters
-produces a GraphQL AST.
+source characters. The result of parsing a source text sequence of Unicode
+characters first produces a sequence of lexical tokens according to the lexical
+grammar which then produces abstract syntax tree (AST) according to the
+syntactical grammar.
 
-A Lexical grammar production describes non-terminal "tokens" by
+A lexical grammar production describes non-terminal "tokens" by
 patterns of terminal Unicode characters. No "whitespace" or other ignored
 characters may appear between any terminal Unicode characters in the lexical
 grammar production. A lexical grammar production is distinguished by a two colon
 `::` definition.
 
-Word :: /[A-Za-z]+/
+Word :: Letter+
 
 A Syntactical grammar production describes non-terminal "rules" by patterns of
-terminal Tokens. Whitespace and other ignored characters may appear before or
-after any terminal Token. A syntactical grammar production is distinguished by a
-one colon `:` definition.
+terminal Tokens. {WhiteSpace} and other {Ignored} sequences may appear before or
+after any terminal {Token}. A syntactical grammar production is distinguished by
+a one colon `:` definition.
 
-Sentence : Noun Verb
+Sentence : Word+ `.`
 
 
 ## Grammar Notation
@@ -80,13 +84,11 @@ and their expanded definitions in the context-free grammar.
 A grammar production may specify that certain expansions are not permitted by
 using the phrase "but not" and then indicating the expansions to be excluded.
 
-For example, the production:
+For example, the following production means that the nonterminal {SafeWord} may
+be replaced by any sequence of characters that could replace {Word} provided
+that the same sequence of characters could not replace {SevenCarlinWords}.
 
-SafeName : Name but not SevenCarlinWords
-
-means that the nonterminal {SafeName} may be replaced by any sequence of
-characters that could replace {Name} provided that the same sequence of
-characters could not replace {SevenCarlinWords}.
+SafeWord : Word but not SevenCarlinWords
 
 A grammar may also list a number of restrictions after "but not" separated
 by "or".
@@ -96,6 +98,18 @@ For example:
 NonBooleanName : Name but not `true` or `false`
 
 
+**Lookahead Restrictions**
+
+A grammar production may specify that certain characters or tokens are not
+permitted to follow it by using the pattern {[lookahead != NotAllowed]}.
+Lookahead restrictions are often used to remove ambiguity from the grammar.
+
+The following example makes it clear that {Letter+} must be greedy, since {Word}
+cannot be followed by yet another {Letter}.
+
+Word :: Letter+ [lookahead != Letter]
+
+
 **Optionality and Lists**
 
 A subscript suffix "{Symbol?}" is shorthand for two possible sequences, one
diff --git a/spec/Appendix B -- Grammar Summary.md b/spec/Appendix B -- Grammar Summary.md
index efdcae8f8..a0308e79c 100644
--- a/spec/Appendix B -- Grammar Summary.md	
+++ b/spec/Appendix B -- Grammar Summary.md	
@@ -1,6 +1,12 @@
 # B. Appendix: Grammar Summary
 
-SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
+## Source Text
+
+SourceCharacter ::
+  - "U+0009"
+  - "U+000A"
+  - "U+000D"
+  - "U+0020–U+FFFF"
 
 
 ## Ignored Tokens
@@ -20,10 +26,10 @@ WhiteSpace ::
 
 LineTerminator ::
   - "New Line (U+000A)"
-  - "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
+  - "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"]
   - "Carriage Return (U+000D)" "New Line (U+000A)"
 
-Comment :: `#` CommentChar*
+Comment :: `#` CommentChar* [lookahead != CommentChar]
 
 CommentChar :: SourceCharacter but not LineTerminator
 
@@ -41,9 +47,28 @@ Token ::
 
 Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }
 
-Name :: /[_A-Za-z][_0-9A-Za-z]*/
+Name ::
+  - NameStart NameContinue* [lookahead != NameContinue]
+
+NameStart ::
+  - Letter
+  - `_`
+
+NameContinue ::
+  - Letter
+  - Digit
+  - `_`
 
-IntValue :: IntegerPart
+Letter :: one of
+  `A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M`
+  `N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
+  `a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m`
+  `n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`
+
+Digit :: one of
+  `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
+
+IntValue :: IntegerPart [lookahead != {Digit, `.`, ExponentPart}]
 
 IntegerPart ::
   - NegativeSign? 0
@@ -51,14 +76,12 @@ IntegerPart ::
 
 NegativeSign :: -
 
-Digit :: one of 0 1 2 3 4 5 6 7 8 9
-
 NonZeroDigit :: Digit but not `0`
 
 FloatValue ::
-  - IntegerPart FractionalPart
-  - IntegerPart ExponentPart
-  - IntegerPart FractionalPart ExponentPart
+  - IntegerPart FractionalPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}]
+  - IntegerPart FractionalPart [lookahead != {Digit, `.`, ExponentIndicator}]
+  - IntegerPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}]
 
 FractionalPart :: . Digit+
 
@@ -69,7 +92,8 @@ ExponentIndicator :: one of `e` `E`
 Sign :: one of + -
 
 StringValue ::
-  - `"` StringCharacter* `"`
+  - `""` [lookahead != `"`]
+  - `"` StringCharacter+ `"`
   - `"""` BlockStringCharacter* `"""`
 
 StringCharacter ::
@@ -89,7 +113,7 @@ Note: Block string values are interpreted to exclude blank initial and trailing
 lines and uniform indentation with {BlockStringValue()}.
 
 
-## Document
+## Document Syntax
 
 Document : Definition+
 
diff --git a/spec/Section 2 -- Language.md b/spec/Section 2 -- Language.md
index ba8123cb1..69b07f0c4 100644
--- a/spec/Section 2 -- Language.md	
+++ b/spec/Section 2 -- Language.md	
@@ -7,16 +7,50 @@ common unit of composition allowing for query reuse.
 
 A GraphQL document is defined as a syntactic grammar where terminal symbols are
 tokens (indivisible lexical units). These tokens are defined in a lexical
-grammar which matches patterns of source characters (defined by a
-double-colon `::`).
+grammar which matches patterns of source characters. In this document, syntactic
+grammar productions are distinguished with a colon `:` while lexical grammar
+productions are distinguished with a double-colon `::`.
 
-Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more details about the definition of lexical and syntactic grammar and other notational conventions
-used in this document.
+The source text of a GraphQL document must be a sequence of {SourceCharacter}.
+The character sequence must be described by a sequence of {Token} and {Ignored}
+lexical grammars. The lexical token sequence, omitting {Ignored}, must be
+described by a single {Document} syntactic grammar.
+
+Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more information
+about the lexical and syntactic grammar and other notational conventions used
+throughout this document.
+
+**Lexical Analysis & Syntactic Parse**
+
+The source text of a GraphQL document is first converted into a sequence of
+lexical tokens, {Token}, and ignored tokens, {Ignored}. The source text is
+scanned from left to right, repeatedly taking the next possible sequence of
+code-points allowed by the lexical grammar productions as the next token. This
+sequence of lexical tokens are then scanned from left to right to produce an
+abstract syntax tree (AST) according to the {Document} syntactical grammar.
+
+Lexical grammar productions in this document use *lookahead restrictions* to
+remove ambiguity and ensure a single valid lexical analysis. A lexical token is
+only valid if not followed by a character in its lookahead restriction.
+
+For example, an {IntValue} has the restriction {[lookahead != Digit]}, so cannot
+be followed by a {Digit}. Because of this, the sequence {`123`} cannot represent
+as the tokens ({`12`}, {`3`}) since {`12`} is followed by the {Digit} {`3`} and
+so must only represent a single token. Use {WhiteSpace} or other {Ignored}
+between characters to represent multiple tokens.
+
+Note: This typically has the same behavior as a
+"[maximal munch](https://en.wikipedia.org/wiki/Maximal_munch)" longest possible
+match, however some lookahead restrictions include additional constraints.
 
 
 ## Source Text
 
-SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
+SourceCharacter ::
+  - "U+0009"
+  - "U+000A"
+  - "U+000D"
+  - "U+0020–U+FFFF"
 
 GraphQL documents are expressed as a sequence of
 [Unicode](https://unicode.org/standard/standard.html) characters. However, with
@@ -60,7 +94,7 @@ control tools.
 
 LineTerminator ::
   - "New Line (U+000A)"
-  - "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
+  - "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"]
   - "Carriage Return (U+000D)" "New Line (U+000A)"
 
 Like white space, line terminators are used to improve the legibility of source
@@ -75,19 +109,20 @@ the line number.
 
 ### Comments
 
-Comment :: `#` CommentChar*
+Comment :: `#` CommentChar* [lookahead != CommentChar]
 
 CommentChar :: SourceCharacter but not LineTerminator
 
 GraphQL source documents may contain single-line comments, starting with the
 {`#`} marker.
 
-A comment can contain any Unicode code point except {LineTerminator} so a
-comment always consists of all code points starting with the {`#`} character up
-to but not including the line terminator.
+A comment can contain any Unicode code point in {SourceCharacter} except
+{LineTerminator} so a comment always consists of all code points starting with
+the {`#`} character up to but not including the {LineTerminator} (or end of
+the source).
 
-Comments behave like white space and may appear after any token, or before a
-line terminator, and have no significance to the semantic meaning of a
+Comments are {Ignored} like white space and may appear after any token, or
+before a {LineTerminator}, and have no significance to the semantic meaning of a
 GraphQL Document.
 
 
@@ -118,8 +153,7 @@ Token ::
 A GraphQL document is comprised of several kinds of indivisible lexical tokens
 defined here in a lexical grammar by patterns of source Unicode characters.
 
-Tokens are later used as terminal symbols in a GraphQL Document
-syntactic grammars.
+Tokens are later used as terminal symbols in GraphQL syntactic grammar rules.
 
 
 ### Ignored Tokens
@@ -131,15 +165,16 @@ Ignored ::
   - Comment
   - Comma
 
-Before and after every lexical token may be any amount of ignored tokens
-including {WhiteSpace} and {Comment}. No ignored regions of a source
-document are significant, however otherwise ignored source characters may appear
-within a lexical token in a significant way, for example a {StringValue} may
-contain white space characters and commas.
+{Ignored} tokens are used to improve readability and provide separation between
+{Token}, but are otherwise insignificant and not referenced in syntactical
+grammar productions.
 
-No characters are ignored while parsing a given token, as an example no
-white space characters are permitted between the characters defining a
-{FloatValue}.
+Any amount of {Ignored} may appear before and after every lexical token. No
+ignored regions of a source document are significant, however {SourceCharacter}
+which appear in {Ignored} may also appear within a lexical {Token} in a
+significant way, for example a {StringValue} may contain white space characters.
+No {Ignored} may appear *within* a {Token}, for example no white space
+characters are permitted between the characters defining a {FloatValue}.
 
 
 ### Punctuators
@@ -153,7 +188,26 @@ lacks the punctuation often used to describe mathematical expressions.
 
 ### Names
 
-Name :: /[_A-Za-z][_0-9A-Za-z]*/
+Name ::
+  - NameStart NameContinue* [lookahead != NameContinue]
+
+NameStart ::
+  - Letter
+  - `_`
+
+NameContinue ::
+  - Letter
+  - Digit
+  - `_`
+
+Letter :: one of
+  `A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M`
+  `N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
+  `a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m`
+  `n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`
+
+Digit :: one of
+  `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
 
 GraphQL Documents are full of named things: operations, fields, arguments,
 types, directives, fragments, and variables. All names must follow the same
@@ -163,8 +217,13 @@ Names in GraphQL are case-sensitive. That is to say `name`, `Name`, and `NAME`
 all refer to different names. Underscores are significant, which means
 `other_name` and `othername` are two different names.
 
-Names in GraphQL are limited to this <acronym>ASCII</acronym> subset of possible
-characters to support interoperation with as many other systems as possible.
+A {Name} must not be followed by a {NameContinue}. In other words, a {Name}
+token is always the longest possible valid sequence. The source characters
+{`a1`} cannot be interpreted as two tokens since {`a`} is followed by the {NameContinue} {`1`}.
+
+Note: Names in GraphQL are limited to the Latin <acronym>ASCII</acronym> subset
+of {SourceCharacter} in order to support interoperation with as many other
+systems as possible.
 
 
 ## Document
@@ -666,7 +725,7 @@ specified as a variable. List and inputs objects may also contain variables (unl
 
 ### Int Value
 
-IntValue :: IntegerPart
+IntValue :: IntegerPart [lookahead != {Digit, `.`, ExponentIndicator}]
 
 IntegerPart ::
   - NegativeSign? 0
@@ -674,19 +733,27 @@ IntegerPart ::
 
 NegativeSign :: -
 
-Digit :: one of 0 1 2 3 4 5 6 7 8 9
-
 NonZeroDigit :: Digit but not `0`
 
-An Int number is specified without a decimal point or exponent (ex. `1`).
+An {IntValue} is specified without a decimal point or exponent but may be
+negative (ex. {-123}). It must not have any leading {0}.
+
+An {IntValue} must not be followed by a {Digit}. In other words, an {IntValue}
+token is always the longest possible valid sequence. The source characters
+{12} cannot be interpreted as two tokens since {1} is followed by the {Digit}
+{2}. This also means the source {00} is invalid since it can neither be
+interpreted as a single token nor two {0} tokens.
+
+An {IntValue} must not be followed by a {.} or {ExponentIndicator}. If either
+follows then the token must only be interpreted as a possible {FloatValue}.
 
 
 ### Float Value
 
 FloatValue ::
-  - IntegerPart FractionalPart
-  - IntegerPart ExponentPart
-  - IntegerPart FractionalPart ExponentPart
+  - IntegerPart FractionalPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}]
+  - IntegerPart FractionalPart [lookahead != {Digit, `.`, ExponentIndicator}]
+  - IntegerPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}]
 
 FractionalPart :: . Digit+
 
@@ -696,8 +763,18 @@ ExponentIndicator :: one of `e` `E`
 
 Sign :: one of + -
 
-A Float number includes either a decimal point (ex. `1.0`) or an exponent
-(ex. `1e50`) or both (ex. `6.0221413e23`).
+A {FloatValue} includes either a decimal point (ex. {1.0}) or an exponent
+(ex. {1e50}) or both (ex. {6.0221413e23}) and may be negative. Like {IntValue},
+it also must not have any leading {0}.
+
+A {FloatValue} must not be followed by a {Digit}. In other words, a {FloatValue}
+token is always the longest possible valid sequence. The source characters
+{1.23} cannot be interpreted as two tokens since {1.2} is followed by the
+{Digit} {3}.
+
+A {FloatValue} must not be followed by a {.} or {ExponentIndicator}. If either
+follows then a parse error occurs. For example, the sequence {1.23.4} cannot be
+interpreted as two tokens ({1.2}, {3.4}).
 
 
 ### Boolean Value
@@ -710,7 +787,8 @@ The two keywords `true` and `false` represent the two boolean values.
 ### String Value
 
 StringValue ::
-  - `"` StringCharacter* `"`
+  - `""` [lookahead != `"`]
+  - `"` StringCharacter+ `"`
   - `"""` BlockStringCharacter* `"""`
 
 StringCharacter ::
@@ -726,10 +804,15 @@ BlockStringCharacter ::
   - SourceCharacter but not `"""` or `\"""`
   - `\"""`
 
-Strings are sequences of characters wrapped in double-quotes (`"`). (ex.
-`"Hello World"`). White space and other otherwise-ignored characters are
+Strings are sequences of characters wrapped in quotation marks (U+0022).
+(ex. {`"Hello World"`}). White space and other otherwise-ignored characters are
 significant within a string value.
 
+The empty string {`""`} must not be followed by another {`"`} otherwise it would
+be interpreted as the beginning of a block string. As an example, the source
+{`""""""`} can only be interpreted as a single empty block string and not three
+empty strings.
+
 Note: Unicode characters are allowed within String value literals, however
 {SourceCharacter} must not contain some ASCII control characters so escape
 sequences must be used to represent these characters.
@@ -790,10 +873,14 @@ block string.
 
 **Semantics**
 
-StringValue :: `"` StringCharacter* `"`
+StringValue :: `""`
+
+  * Return an empty sequence.
+
+StringValue :: `"` StringCharacter+ `"`
 
   * Return the Unicode character sequence of all {StringCharacter}
-    Unicode character values (which may be an empty sequence).
+    Unicode character values.
 
 StringCharacter :: SourceCharacter but not `"` or \ or LineTerminator