Skip to content

Commit

Permalink
Clarify lexing is greedy with lookahead restrictions. (#599)
Browse files Browse the repository at this point in the history
GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings.

This also removes regular expression representation from the lexical grammar notation, since it wasn't always clear.

Either way, the additional clarity removes ambiguity from the spec

Partial fix for #564

Specifically addresses #564 (comment)
  • Loading branch information
leebyron authored Jan 10, 2020
1 parent e491220 commit a73cd6f
Show file tree
Hide file tree
Showing 3 changed files with 192 additions and 67 deletions.
46 changes: 30 additions & 16 deletions spec/Appendix A -- Notation Conventions.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,10 @@ of the sequences it is defined by, until all non-terminal symbols have been
replaced by terminal characters.

Terminals are represented in this document in a monospace font in two forms: a
specific Unicode character or sequence of Unicode characters (ex. {`=`} or {`terminal`}), and a pattern of Unicode characters defined by a regular expression
(ex {/[0-9]+/}).
specific Unicode character or sequence of Unicode characters (ie. {`=`} or
{`terminal`}), and prose typically describing a specific Unicode code-point
{"Space (U+0020)"}. Sequences of Unicode characters only appear in syntactic
grammars and represent a {Name} token of that specific sequence.

Non-terminal production rules are represented in this document using the
following notation for a non-terminal with a single definition:
Expand All @@ -48,23 +50,25 @@ ListOfLetterA :

The GraphQL language is defined in a syntactic grammar where terminal symbols
are tokens. Tokens are defined in a lexical grammar which matches patterns of
source characters. The result of parsing a sequence of source Unicode characters
produces a GraphQL AST.
source characters. The result of parsing a source text sequence of Unicode
characters first produces a sequence of lexical tokens according to the lexical
grammar which then produces abstract syntax tree (AST) according to the
syntactical grammar.

A Lexical grammar production describes non-terminal "tokens" by
A lexical grammar production describes non-terminal "tokens" by
patterns of terminal Unicode characters. No "whitespace" or other ignored
characters may appear between any terminal Unicode characters in the lexical
grammar production. A lexical grammar production is distinguished by a two colon
`::` definition.

Word :: /[A-Za-z]+/
Word :: Letter+

A Syntactical grammar production describes non-terminal "rules" by patterns of
terminal Tokens. Whitespace and other ignored characters may appear before or
after any terminal Token. A syntactical grammar production is distinguished by a
one colon `:` definition.
terminal Tokens. {WhiteSpace} and other {Ignored} sequences may appear before or
after any terminal {Token}. A syntactical grammar production is distinguished by
a one colon `:` definition.

Sentence : Noun Verb
Sentence : Word+ `.`


## Grammar Notation
Expand All @@ -80,13 +84,11 @@ and their expanded definitions in the context-free grammar.
A grammar production may specify that certain expansions are not permitted by
using the phrase "but not" and then indicating the expansions to be excluded.

For example, the production:
For example, the following production means that the nonterminal {SafeWord} may
be replaced by any sequence of characters that could replace {Word} provided
that the same sequence of characters could not replace {SevenCarlinWords}.

SafeName : Name but not SevenCarlinWords

means that the nonterminal {SafeName} may be replaced by any sequence of
characters that could replace {Name} provided that the same sequence of
characters could not replace {SevenCarlinWords}.
SafeWord : Word but not SevenCarlinWords

A grammar may also list a number of restrictions after "but not" separated
by "or".
Expand All @@ -96,6 +98,18 @@ For example:
NonBooleanName : Name but not `true` or `false`


**Lookahead Restrictions**

A grammar production may specify that certain characters or tokens are not
permitted to follow it by using the pattern {[lookahead != NotAllowed]}.
Lookahead restrictions are often used to remove ambiguity from the grammar.

The following example makes it clear that {Letter+} must be greedy, since {Word}
cannot be followed by yet another {Letter}.

Word :: Letter+ [lookahead != Letter]


**Optionality and Lists**

A subscript suffix "{Symbol?}" is shorthand for two possible sequences, one
Expand Down
48 changes: 36 additions & 12 deletions spec/Appendix B -- Grammar Summary.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# B. Appendix: Grammar Summary

SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
## Source Text

SourceCharacter ::
- "U+0009"
- "U+000A"
- "U+000D"
- "U+0020–U+FFFF"


## Ignored Tokens
Expand All @@ -20,10 +26,10 @@ WhiteSpace ::

LineTerminator ::
- "New Line (U+000A)"
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
- "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"]
- "Carriage Return (U+000D)" "New Line (U+000A)"

Comment :: `#` CommentChar*
Comment :: `#` CommentChar* [lookahead != CommentChar]

CommentChar :: SourceCharacter but not LineTerminator

Expand All @@ -41,24 +47,41 @@ Token ::

Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }

Name :: /[_A-Za-z][_0-9A-Za-z]*/
Name ::
- NameStart NameContinue* [lookahead != NameContinue]

NameStart ::
- Letter
- `_`

NameContinue ::
- Letter
- Digit
- `_`

IntValue :: IntegerPart
Letter :: one of
`A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M`
`N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
`a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m`
`n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`

Digit :: one of
`0` `1` `2` `3` `4` `5` `6` `7` `8` `9`

IntValue :: IntegerPart [lookahead != {Digit, `.`, ExponentPart}]

IntegerPart ::
- NegativeSign? 0
- NegativeSign? NonZeroDigit Digit*

NegativeSign :: -

Digit :: one of 0 1 2 3 4 5 6 7 8 9

NonZeroDigit :: Digit but not `0`

FloatValue ::
- IntegerPart FractionalPart
- IntegerPart ExponentPart
- IntegerPart FractionalPart ExponentPart
- IntegerPart FractionalPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}]
- IntegerPart FractionalPart [lookahead != {Digit, `.`, ExponentIndicator}]
- IntegerPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}]

FractionalPart :: . Digit+

Expand All @@ -69,7 +92,8 @@ ExponentIndicator :: one of `e` `E`
Sign :: one of + -

StringValue ::
- `"` StringCharacter* `"`
- `""` [lookahead != `"`]
- `"` StringCharacter+ `"`
- `"""` BlockStringCharacter* `"""`

StringCharacter ::
Expand All @@ -89,7 +113,7 @@ Note: Block string values are interpreted to exclude blank initial and trailing
lines and uniform indentation with {BlockStringValue()}.


## Document
## Document Syntax

Document : Definition+

Expand Down
Loading

0 comments on commit a73cd6f

Please sign in to comment.