Clarify lexing is greedy with lookahead restrictions. (#599)

GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings. This also removes regular expression representation from the lexical grammar notation, since it wasn't always clear. Either way, the additional clarity removes ambiguity from the spec Partial fix for #564 Specifically addresses #564 (comment)
graphql · Jan 10, 2020 · a73cd6f · a73cd6f
1 parent e491220
commit a73cd6f
Show file tree

Hide file tree

Showing 3 changed files with 192 additions and 67 deletions.
diff --git a/spec/Appendix A -- Notation Conventions.md b/spec/Appendix A -- Notation Conventions.md
@@ -22,8 +22,10 @@ of the sequences it is defined by, until all non-terminal symbols have been
 replaced by terminal characters.
 
 Terminals are represented in this document in a monospace font in two forms: a
-specific Unicode character or sequence of Unicode characters (ex. {`=`} or {`terminal`}), and a pattern of Unicode characters defined by a regular expression
-(ex {/[0-9]+/}).
+specific Unicode character or sequence of Unicode characters (ie. {`=`} or
+{`terminal`}), and prose typically describing a specific Unicode code-point
+{"Space (U+0020)"}. Sequences of Unicode characters only appear in syntactic
+grammars and represent a {Name} token of that specific sequence.
 
 Non-terminal production rules are represented in this document using the
 following notation for a non-terminal with a single definition:
@@ -48,23 +50,25 @@ ListOfLetterA :
 
 The GraphQL language is defined in a syntactic grammar where terminal symbols
 are tokens. Tokens are defined in a lexical grammar which matches patterns of
-source characters. The result of parsing a sequence of source Unicode characters
-produces a GraphQL AST.
+source characters. The result of parsing a source text sequence of Unicode
+characters first produces a sequence of lexical tokens according to the lexical
+grammar which then produces abstract syntax tree (AST) according to the
+syntactical grammar.
 
-A Lexical grammar production describes non-terminal "tokens" by
+A lexical grammar production describes non-terminal "tokens" by
 patterns of terminal Unicode characters. No "whitespace" or other ignored
 characters may appear between any terminal Unicode characters in the lexical
 grammar production. A lexical grammar production is distinguished by a two colon
 `::` definition.
 
-Word :: /[A-Za-z]+/
+Word :: Letter+
 
 A Syntactical grammar production describes non-terminal "rules" by patterns of
-terminal Tokens. Whitespace and other ignored characters may appear before or
-after any terminal Token. A syntactical grammar production is distinguished by a
-one colon `:` definition.
+terminal Tokens. {WhiteSpace} and other {Ignored} sequences may appear before or
+after any terminal {Token}. A syntactical grammar production is distinguished by
+a one colon `:` definition.
 
-Sentence : Noun Verb
+Sentence : Word+ `.`
 
 
 ## Grammar Notation
@@ -80,13 +84,11 @@ and their expanded definitions in the context-free grammar.
 A grammar production may specify that certain expansions are not permitted by
 using the phrase "but not" and then indicating the expansions to be excluded.
 
-For example, the production:
+For example, the following production means that the nonterminal {SafeWord} may
+be replaced by any sequence of characters that could replace {Word} provided
+that the same sequence of characters could not replace {SevenCarlinWords}.
 
-SafeName : Name but not SevenCarlinWords
-
-means that the nonterminal {SafeName} may be replaced by any sequence of
-characters that could replace {Name} provided that the same sequence of
-characters could not replace {SevenCarlinWords}.
+SafeWord : Word but not SevenCarlinWords
 
 A grammar may also list a number of restrictions after "but not" separated
 by "or".
@@ -96,6 +98,18 @@ For example:
 NonBooleanName : Name but not `true` or `false`
 
 
+**Lookahead Restrictions**
+
+A grammar production may specify that certain characters or tokens are not
+permitted to follow it by using the pattern {[lookahead != NotAllowed]}.
+Lookahead restrictions are often used to remove ambiguity from the grammar.
+
+The following example makes it clear that {Letter+} must be greedy, since {Word}
+cannot be followed by yet another {Letter}.
+
+Word :: Letter+ [lookahead != Letter]
+
+
 **Optionality and Lists**
 
 A subscript suffix "{Symbol?}" is shorthand for two possible sequences, one

diff --git a/spec/Appendix B -- Grammar Summary.md b/spec/Appendix B -- Grammar Summary.md
@@ -1,6 +1,12 @@
 # B. Appendix: Grammar Summary
 
-SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
+## Source Text
+
+SourceCharacter ::
+  - "U+0009"
+  - "U+000A"
+  - "U+000D"
+  - "U+0020–U+FFFF"
 
 
 ## Ignored Tokens
@@ -20,10 +26,10 @@ WhiteSpace ::
 
 LineTerminator ::
   - "New Line (U+000A)"
-  - "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
+  - "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"]
   - "Carriage Return (U+000D)" "New Line (U+000A)"
 
-Comment :: `#` CommentChar*
+Comment :: `#` CommentChar* [lookahead != CommentChar]
 
 CommentChar :: SourceCharacter but not LineTerminator
 
@@ -41,24 +47,41 @@ Token ::
 
 Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }
 
-Name :: /[_A-Za-z][_0-9A-Za-z]*/
+Name ::
+  - NameStart NameContinue* [lookahead != NameContinue]
+
+NameStart ::
+  - Letter
+  - `_`
+
+NameContinue ::
+  - Letter
+  - Digit
+  - `_`
 
-IntValue :: IntegerPart
+Letter :: one of
+  `A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M`
+  `N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
+  `a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m`
+  `n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`
+
+Digit :: one of
+  `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
+
+IntValue :: IntegerPart [lookahead != {Digit, `.`, ExponentPart}]
 
 IntegerPart ::
   - NegativeSign? 0
   - NegativeSign? NonZeroDigit Digit*
 
 NegativeSign :: -
 
-Digit :: one of 0 1 2 3 4 5 6 7 8 9
-
 NonZeroDigit :: Digit but not `0`
 
 FloatValue ::
-  - IntegerPart FractionalPart
-  - IntegerPart ExponentPart
-  - IntegerPart FractionalPart ExponentPart
+  - IntegerPart FractionalPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}]
+  - IntegerPart FractionalPart [lookahead != {Digit, `.`, ExponentIndicator}]
+  - IntegerPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}]
 
 FractionalPart :: . Digit+
 
@@ -69,7 +92,8 @@ ExponentIndicator :: one of `e` `E`
 Sign :: one of + -
 
 StringValue ::
-  - `"` StringCharacter* `"`
+  - `""` [lookahead != `"`]
+  - `"` StringCharacter+ `"`
   - `"""` BlockStringCharacter* `"""`
 
 StringCharacter ::
@@ -89,7 +113,7 @@ Note: Block string values are interpreted to exclude blank initial and trailing
 lines and uniform indentation with {BlockStringValue()}.
 
 
-## Document
+## Document Syntax
 
 Document : Definition+