diff --git a/spec/Appendix A -- Notation Conventions.md b/spec/Appendix A -- Notation Conventions.md index cbb8e8a3a..14d55fc70 100644 --- a/spec/Appendix A -- Notation Conventions.md +++ b/spec/Appendix A -- Notation Conventions.md @@ -22,8 +22,10 @@ of the sequences it is defined by, until all non-terminal symbols have been replaced by terminal characters. Terminals are represented in this document in a monospace font in two forms: a -specific Unicode character or sequence of Unicode characters (ex. {`=`} or {`terminal`}), and a pattern of Unicode characters defined by a regular expression -(ex {/[0-9]+/}). +specific Unicode character or sequence of Unicode characters (ie. {`=`} or +{`terminal`}), and prose typically describing a specific Unicode code-point +{"Space (U+0020)"}. Sequences of Unicode characters only appear in syntactic +grammars and represent a {Name} token of that specific sequence. Non-terminal production rules are represented in this document using the following notation for a non-terminal with a single definition: @@ -48,23 +50,25 @@ ListOfLetterA : The GraphQL language is defined in a syntactic grammar where terminal symbols are tokens. Tokens are defined in a lexical grammar which matches patterns of -source characters. The result of parsing a sequence of source Unicode characters -produces a GraphQL AST. +source characters. The result of parsing a source text sequence of Unicode +characters first produces a sequence of lexical tokens according to the lexical +grammar which then produces abstract syntax tree (AST) according to the +syntactical grammar. -A Lexical grammar production describes non-terminal "tokens" by +A lexical grammar production describes non-terminal "tokens" by patterns of terminal Unicode characters. No "whitespace" or other ignored characters may appear between any terminal Unicode characters in the lexical grammar production. A lexical grammar production is distinguished by a two colon `::` definition. -Word :: /[A-Za-z]+/ +Word :: Letter+ A Syntactical grammar production describes non-terminal "rules" by patterns of -terminal Tokens. Whitespace and other ignored characters may appear before or -after any terminal Token. A syntactical grammar production is distinguished by a -one colon `:` definition. +terminal Tokens. {WhiteSpace} and other {Ignored} sequences may appear before or +after any terminal {Token}. A syntactical grammar production is distinguished by +a one colon `:` definition. -Sentence : Noun Verb +Sentence : Word+ `.` ## Grammar Notation @@ -80,13 +84,11 @@ and their expanded definitions in the context-free grammar. A grammar production may specify that certain expansions are not permitted by using the phrase "but not" and then indicating the expansions to be excluded. -For example, the production: +For example, the following production means that the nonterminal {SafeWord} may +be replaced by any sequence of characters that could replace {Word} provided +that the same sequence of characters could not replace {SevenCarlinWords}. -SafeName : Name but not SevenCarlinWords - -means that the nonterminal {SafeName} may be replaced by any sequence of -characters that could replace {Name} provided that the same sequence of -characters could not replace {SevenCarlinWords}. +SafeWord : Word but not SevenCarlinWords A grammar may also list a number of restrictions after "but not" separated by "or". @@ -96,6 +98,18 @@ For example: NonBooleanName : Name but not `true` or `false` +**Lookahead Restrictions** + +A grammar production may specify that certain characters or tokens are not +permitted to follow it by using the pattern {[lookahead != NotAllowed]}. +Lookahead restrictions are often used to remove ambiguity from the grammar. + +The following example makes it clear that {Letter+} must be greedy, since {Word} +cannot be followed by yet another {Letter}. + +Word :: Letter+ [lookahead != Letter] + + **Optionality and Lists** A subscript suffix "{Symbol?}" is shorthand for two possible sequences, one diff --git a/spec/Appendix B -- Grammar Summary.md b/spec/Appendix B -- Grammar Summary.md index efdcae8f8..a0308e79c 100644 --- a/spec/Appendix B -- Grammar Summary.md +++ b/spec/Appendix B -- Grammar Summary.md @@ -1,6 +1,12 @@ # B. Appendix: Grammar Summary -SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/ +## Source Text + +SourceCharacter :: + - "U+0009" + - "U+000A" + - "U+000D" + - "U+0020–U+FFFF" ## Ignored Tokens @@ -20,10 +26,10 @@ WhiteSpace :: LineTerminator :: - "New Line (U+000A)" - - "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ] + - "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"] - "Carriage Return (U+000D)" "New Line (U+000A)" -Comment :: `#` CommentChar* +Comment :: `#` CommentChar* [lookahead != CommentChar] CommentChar :: SourceCharacter but not LineTerminator @@ -41,9 +47,28 @@ Token :: Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | } -Name :: /[_A-Za-z][_0-9A-Za-z]*/ +Name :: + - NameStart NameContinue* [lookahead != NameContinue] + +NameStart :: + - Letter + - `_` + +NameContinue :: + - Letter + - Digit + - `_` -IntValue :: IntegerPart +Letter :: one of + `A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M` + `N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z` + `a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m` + `n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z` + +Digit :: one of + `0` `1` `2` `3` `4` `5` `6` `7` `8` `9` + +IntValue :: IntegerPart [lookahead != {Digit, `.`, ExponentPart}] IntegerPart :: - NegativeSign? 0 @@ -51,14 +76,12 @@ IntegerPart :: NegativeSign :: - -Digit :: one of 0 1 2 3 4 5 6 7 8 9 - NonZeroDigit :: Digit but not `0` FloatValue :: - - IntegerPart FractionalPart - - IntegerPart ExponentPart - - IntegerPart FractionalPart ExponentPart + - IntegerPart FractionalPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}] + - IntegerPart FractionalPart [lookahead != {Digit, `.`, ExponentIndicator}] + - IntegerPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}] FractionalPart :: . Digit+ @@ -69,7 +92,8 @@ ExponentIndicator :: one of `e` `E` Sign :: one of + - StringValue :: - - `"` StringCharacter* `"` + - `""` [lookahead != `"`] + - `"` StringCharacter+ `"` - `"""` BlockStringCharacter* `"""` StringCharacter :: @@ -89,7 +113,7 @@ Note: Block string values are interpreted to exclude blank initial and trailing lines and uniform indentation with {BlockStringValue()}. -## Document +## Document Syntax Document : Definition+ diff --git a/spec/Section 2 -- Language.md b/spec/Section 2 -- Language.md index ba8123cb1..189b74796 100644 --- a/spec/Section 2 -- Language.md +++ b/spec/Section 2 -- Language.md @@ -7,16 +7,50 @@ common unit of composition allowing for query reuse. A GraphQL document is defined as a syntactic grammar where terminal symbols are tokens (indivisible lexical units). These tokens are defined in a lexical -grammar which matches patterns of source characters (defined by a -double-colon `::`). +grammar which matches patterns of source characters. In this document, syntactic +grammar productions are distinguished with a colon `:` while lexical grammar +productions are distinguished with a double-colon `::`. -Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more details about the definition of lexical and syntactic grammar and other notational conventions -used in this document. +The source text of a GraphQL document must be a sequence of {SourceCharacter}. +The character sequence must be described by a sequence of {Token} and {Ignored} +lexical grammars. The lexical token sequence, omitting {Ignored}, must be +described by a single {Document} syntactic grammar. + +Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more information +about the lexical and syntactic grammar and other notational conventions used +throughout this document. + +**Lexical Analysis & Syntactic Parse** + +The source text of a GraphQL document is first converted into a sequence of +lexical tokens, {Token}, and ignored tokens, {Ignored}. The source text is +scanned from left to right, repeatedly taking the next possible sequence of +code-points allowed by the lexical grammar productions as the next token. This +sequence of lexical tokens are then scanned from left to right to produce an +abstract syntax tree (AST) according to the {Document} syntactical grammar. + +Lexical grammar productions in this document use *lookahead restrictions* to +remove ambiguity and ensure a single valid lexical analysis. A lexical token is +only valid if not followed by a character in its lookahead restriction. + +For example, an {IntValue} has the restriction {[lookahead != Digit]}, so cannot +be followed by a {Digit}. Because of this, the sequence {`123`} cannot represent +as the tokens ({`12`}, {`3`}) since {`12`} is followed by the {Digit} {`3`} and +so must only represent a single token. Use {WhiteSpace} or other {Ignored} +between characters to represent multiple tokens. + +Note: This typically has the same behavior as a +"[maximal munch](https://en.wikipedia.org/wiki/Maximal_munch)" longest possible +match, however some lookahead restrictions include additional constraints. ## Source Text -SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/ +SourceCharacter :: + - "U+0009" + - "U+000A" + - "U+000D" + - "U+0020–U+FFFF" GraphQL documents are expressed as a sequence of [Unicode](https://unicode.org/standard/standard.html) characters. However, with @@ -60,7 +94,7 @@ control tools. LineTerminator :: - "New Line (U+000A)" - - "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ] + - "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"] - "Carriage Return (U+000D)" "New Line (U+000A)" Like white space, line terminators are used to improve the legibility of source @@ -75,19 +109,20 @@ the line number. ### Comments -Comment :: `#` CommentChar* +Comment :: `#` CommentChar* [lookahead != CommentChar] CommentChar :: SourceCharacter but not LineTerminator GraphQL source documents may contain single-line comments, starting with the {`#`} marker. -A comment can contain any Unicode code point except {LineTerminator} so a -comment always consists of all code points starting with the {`#`} character up -to but not including the line terminator. +A comment can contain any Unicode code point in {SourceCharacter} except +{LineTerminator} so a comment always consists of all code points starting with +the {`#`} character up to but not including the {LineTerminator} (or end of +the source). -Comments behave like white space and may appear after any token, or before a -line terminator, and have no significance to the semantic meaning of a +Comments are {Ignored} like white space and may appear after any token, or +before a {LineTerminator}, and have no significance to the semantic meaning of a GraphQL Document. @@ -118,8 +153,7 @@ Token :: A GraphQL document is comprised of several kinds of indivisible lexical tokens defined here in a lexical grammar by patterns of source Unicode characters. -Tokens are later used as terminal symbols in a GraphQL Document -syntactic grammars. +Tokens are later used as terminal symbols in GraphQL syntactic grammar rules. ### Ignored Tokens @@ -131,15 +165,16 @@ Ignored :: - Comment - Comma -Before and after every lexical token may be any amount of ignored tokens -including {WhiteSpace} and {Comment}. No ignored regions of a source -document are significant, however otherwise ignored source characters may appear -within a lexical token in a significant way, for example a {StringValue} may -contain white space characters and commas. +{Ignored} tokens are used to improve readability and provide separation between +{Token}, but are otherwise insignificant and not referenced in syntactical +grammar productions. -No characters are ignored while parsing a given token, as an example no -white space characters are permitted between the characters defining a -{FloatValue}. +Any amount of {Ignored} may appear before and after every lexical token. No +ignored regions of a source document are significant, however ignored source +characters may appear within a lexical token in a significant way, for example a +{StringValue} may contain white space characters. No characters are ignored +within a {Token}, as an example no white space characters are permitted between +the characters defining a {FloatValue}. ### Punctuators @@ -153,7 +188,26 @@ lacks the punctuation often used to describe mathematical expressions. ### Names -Name :: /[_A-Za-z][_0-9A-Za-z]*/ +Name :: + - NameStart NameContinue* [lookahead != NameContinue] + +NameStart :: + - Letter + - `_` + +NameContinue :: + - Letter + - Digit + - `_` + +Letter :: one of + `A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M` + `N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z` + `a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m` + `n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z` + +Digit :: one of + `0` `1` `2` `3` `4` `5` `6` `7` `8` `9` GraphQL Documents are full of named things: operations, fields, arguments, types, directives, fragments, and variables. All names must follow the same @@ -163,8 +217,13 @@ Names in GraphQL are case-sensitive. That is to say `name`, `Name`, and `NAME` all refer to different names. Underscores are significant, which means `other_name` and `othername` are two different names. -Names in GraphQL are limited to this ASCII subset of possible -characters to support interoperation with as many other systems as possible. +A {Name} must not be followed by a {NameContinue}. In other words, a {Name} +token is always the longest possible valid sequence. The source characters +{`a1`} cannot be interpreted as two tokens since {`a`} is followed by the {NameContinue} {`1`}. + +Note: Names in GraphQL are limited to the Latin ASCII subset +of {SourceCharacter} in order to support interoperation with as many other +systems as possible. ## Document @@ -666,7 +725,7 @@ specified as a variable. List and inputs objects may also contain variables (unl ### Int Value -IntValue :: IntegerPart +IntValue :: IntegerPart [lookahead != {Digit, `.`, ExponentIndicator}] IntegerPart :: - NegativeSign? 0 @@ -674,19 +733,27 @@ IntegerPart :: NegativeSign :: - -Digit :: one of 0 1 2 3 4 5 6 7 8 9 - NonZeroDigit :: Digit but not `0` -An Int number is specified without a decimal point or exponent (ex. `1`). +An {IntValue} is specified without a decimal point or exponent but may be +negative (ex. {-123}). It must not have any leading {0}. + +An {IntValue} must not be followed by a {Digit}. In other words, an {IntValue} +token is always the longest possible valid sequence. The source characters +{12} cannot be interpreted as two tokens since {1} is followed by the {Digit} +{2}. This also means the source {00} is invalid since it can neither be +interpreted as a single token nor two {0} tokens. + +An {IntValue} must not be followed by a {.} or {ExponentIndicator}. If either +follows then the token must only be interpreted as a possible {FloatValue}. ### Float Value FloatValue :: - - IntegerPart FractionalPart - - IntegerPart ExponentPart - - IntegerPart FractionalPart ExponentPart + - IntegerPart FractionalPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}] + - IntegerPart FractionalPart [lookahead != {Digit, `.`, ExponentIndicator}] + - IntegerPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}] FractionalPart :: . Digit+ @@ -696,8 +763,18 @@ ExponentIndicator :: one of `e` `E` Sign :: one of + - -A Float number includes either a decimal point (ex. `1.0`) or an exponent -(ex. `1e50`) or both (ex. `6.0221413e23`). +A {FloatValue} includes either a decimal point (ex. {1.0}) or an exponent +(ex. {1e50}) or both (ex. {6.0221413e23}) and may be negative. Like {IntValue}, +it also must not have any leading {0}. + +A {FloatValue} must not be followed by a {Digit}. In other words, a {FloatValue} +token is always the longest possible valid sequence. The source characters +{1.23} cannot be interpreted as two tokens since {1.2} is followed by the +{Digit} {3}. + +A {FloatValue} must not be followed by a {.} or {ExponentIndicator}. If either +follows then a parse error occurs. For example, the sequence {1.23.4} cannot be +interpreted as two tokens ({1.2}, {3.4}). ### Boolean Value @@ -710,7 +787,8 @@ The two keywords `true` and `false` represent the two boolean values. ### String Value StringValue :: - - `"` StringCharacter* `"` + - `""` [lookahead != `"`] + - `"` StringCharacter+ `"` - `"""` BlockStringCharacter* `"""` StringCharacter :: @@ -726,10 +804,15 @@ BlockStringCharacter :: - SourceCharacter but not `"""` or `\"""` - `\"""` -Strings are sequences of characters wrapped in double-quotes (`"`). (ex. -`"Hello World"`). White space and other otherwise-ignored characters are +Strings are sequences of characters wrapped in quotation marks (U+0022). +(ex. {`"Hello World"`}). White space and other otherwise-ignored characters are significant within a string value. +The empty string {`""`} must not be followed by another {`"`} otherwise it would +be interpreted as the beginning of a block string. As an example, the source +{`""""""`} can only be interpreted as a single empty block string and not three +empty strings. + Note: Unicode characters are allowed within String value literals, however {SourceCharacter} must not contain some ASCII control characters so escape sequences must be used to represent these characters. @@ -790,10 +873,14 @@ block string. **Semantics** -StringValue :: `"` StringCharacter* `"` +StringValue :: `""` + + * Return an empty sequence. + +StringValue :: `"` StringCharacter+ `"` * Return the Unicode character sequence of all {StringCharacter} - Unicode character values (which may be an empty sequence). + Unicode character values. StringCharacter :: SourceCharacter but not `"` or \ or LineTerminator