Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Number value literal lookahead restrictions #601

Merged
merged 1 commit into from
Jan 10, 2020

Conversation

leebyron
Copy link
Collaborator

This RFC proposes adding a lookahead restriction to the IntValue and FloatValue lexical grammars to not allow following a number with a letter.

Problem:

Currently there are some language ambiguities and underspecification for lexing numbers which each implementation has handled slightly differently.

Because commas are optional and white space isn't required between tokens, these two snippets are equivalent: [123, abc], [123abc]. This may be confusing to read, but it should parse correctly. However the opposite is not true, since digits may belong in a Name, the following two are not equivalent: [abc, 123], [abc123]. This could lead to mistakes.

Ambiguity and underspecification enter when the Name starts with "e", since "e" indicats the beginning of an exponent in a FloatValue. 123efg is a lexical error in GraphQL.js which greedily starts to lex a FloatValue when it encounters the "e", however you might also expect it to validly lex (123, efg) and some implementations might do this.

Further, other languages offer variations of numeric literals which GraphQL does not support, such as hexidecimal literals. The input 0x1F properly lexes as (0, x, 1, F) however this is very likely a confusing syntax error. A similar issue exists for some languages which allow underscores in numbers for readability, 1_000 lexes a 1 and _ but fails when 000 is not a valid number.

Proposed Solution:

Add a lookahead restriction to IntValue and FloatValue to disallow any NameStart character (including letters and _) to follow.

This makes it clear that 1e5 can only possibly be one FloatValue and not three tokens, makes lexer errors specified clearly to remove ambiguity, and provides clear errors for mistaken input.

Precedent

Javascript applies this same restriction for similar reasons, I believe originally to produce an early error if C-style typed literals were used in a Javascript program.

https://www.ecma-international.org/ecma-262/10.0/index.html#sec-literals-numeric-literals

Cost of change

While this is technically a breaking change to the language grammar, it seeks to restrict cases that are almost certainly already producing either syntax or validation errors.

This is different from the current implementation of GraphQL.js and I believe other parsers, and will require minor implementation updates.

@leebyron leebyron added the 💭 Strawman (RFC 0) RFC Stage 0 (See CONTRIBUTING.md) label Jul 30, 2019
@leebyron leebyron self-assigned this Jul 30, 2019
@leebyron
Copy link
Collaborator Author

I staged this PR atop the editorial-greedy-lexer branch since it depends on that fix. I'll update once that's reviewed and landed, but I wanted to get this up for discussion.

leebyron added a commit to graphql/graphql-wg that referenced this pull request Aug 1, 2019
@michaelstaib
Copy link
Member

michaelstaib commented Aug 1, 2019

This essentially means that we have to lookahead multiple characters in case of 123efg. While e could be part of a number the f following e makes it clear to the lexer that the 123 is an int token and the following efg is a name token.

Add a lookahead restriction to IntValue and FloatValue to disallow any NameStart character (including letters and _) to follow.

Disallow any name NameStart character except e.

@leebyron
Copy link
Collaborator Author

leebyron commented Aug 1, 2019

You should only ever have to look one character ahead. It’s exactly this reason why e needs to be included in the restriction.

This proposes that both Int and Float tokens cannot be followed by a letter. So in the case of 123efg:

At position 123 the lexer knows it is producing either an Int or a Float token and considers the next character. At position 123e the lexer knows that 123 cannot be a valid Int token since it cannot be followed by a letter. It could still be a Float token since we’re now in the grammar for exponent. It considers the next character. At 123ef it knows this no longer matches the Float token grammar and no more tokens are legal and produces a syntax error. If we were to exclude e from the lookahead restriction, then we would have to backtrack to produce a valid Int token, but that’s exactly what I’m trying to prevent in this RFC

@leebyron leebyron added 💡 Proposal (RFC 1) RFC Stage 1 (See CONTRIBUTING.md) 🚀 Next Stage? This RFC believes it is ready for the next stage and removed 💭 Strawman (RFC 0) RFC Stage 0 (See CONTRIBUTING.md) 🚀 Next Stage? This RFC believes it is ready for the next stage labels Aug 1, 2019
@leebyron leebyron force-pushed the editorial-greedy-lexer branch 3 times, most recently from 100d8fe to 1ebfc40 Compare August 7, 2019 05:33
@leebyron leebyron force-pushed the rfc-number-restriction branch from 73c0786 to 50ac83c Compare August 7, 2019 05:55
@leebyron
Copy link
Collaborator Author

leebyron commented Aug 7, 2019

Updated to rebase on #599 and include some additional prose and non-normative notes explaining the impact of this restriction and the rationale.

Suggested test cases (thanks @robzhu for suggesting to document these):

These test cases document scenarios which have a change in behavior after applying this RFC (except for the last clarification test case for each set). Don't forget to also look at the test cases from #599 (see this comment).

  • IntValue
    • 0xF1 is now a parse error, no longer tokens [0, xF1]
    • 0b10 is now a parse error, no longer tokens [0, b10]
    • 123abc is now a parse error, no longer tokens [123, abc]
    • 1_234 is now a parse error, no longer tokens [1, _234]
    • remains a parse error since ß does not belong to NameStart
  • FloatValue
    • 1.23f is now a parse error, no longer tokens [1.23, f]
    • 1.234_5 is now a parse error, no longer tokens [1.234, _5]
    • 1.2ß remains a parse error since ß does not belong to NameStart

leebyron added a commit to graphql/graphql-js that referenced this pull request Sep 11, 2019
Implements and adds the tests described by graphql/graphql-spec#601
@leebyron leebyron added the 🚀 Next Stage? This RFC believes it is ready for the next stage label Sep 11, 2019
leebyron added a commit to graphql/graphql-js that referenced this pull request Sep 12, 2019
Implements and adds the tests described by graphql/graphql-spec#601
@IvanGoncharov IvanGoncharov added 📄 Draft (RFC 2) RFC Stage 2 (See CONTRIBUTING.md) and removed 💡 Proposal (RFC 1) RFC Stage 1 (See CONTRIBUTING.md) 🚀 Next Stage? This RFC believes it is ready for the next stage labels Sep 13, 2019
IvanGoncharov pushed a commit to graphql/graphql-js that referenced this pull request Sep 15, 2019
Implements and adds the tests described by graphql/graphql-spec#601
Cito added a commit to graphql-python/graphql-core that referenced this pull request Dec 21, 2019
@leebyron leebyron force-pushed the editorial-greedy-lexer branch 2 times, most recently from 61050cb to 8248e62 Compare January 10, 2020 20:30
@leebyron
Copy link
Collaborator Author

This RFC proposes adding a lookahead restriction to the IntValue and FloatValue lexical grammars to not allow following a number with a letter.

**Problem:**

Currently there are some language ambiguities and underspecification for lexing numbers which each implementation has handled slightly differently.

Because commas are optional and white space isn't required between tokens, these two snippets are equivalent: `[123, abc]`, `[123abc]`. This may be confusing to read, but it should parse correctly. However the opposite is not true, since digits may belong in a Name, the following two are *not* equivalent: `[abc, 123]`, `[abc123]`. This could lead to mistakes.

Ambiguity and underspecification enter when the Name starts with "e", since "e" indicats the beginning of an exponent in a FloatValue. `123efg` is a lexical error in GraphQL.js which greedily starts to lex a FloatValue when it encounters the "e", however you might also expect it to validly lex (`123`, `efg`) and some implementations might do this.

Further, other languages offer variations of numeric literals which GraphQL does not support, such as hexidecimal literals. The input `0x1F` properly lexes as (`0`, `x`, `1`, `F`) however this is very likely a confusing syntax error. A similar issue exists for some languages which allow underscores in numbers for readability, `1_000` lexes a `1` and `_` but fails when `000` is not a valid number.

**Proposed Solution:**

Add a lookahead restriction to IntValue and FloatValue to disallow any NameStart character (including letters and `_`) to follow.

This makes it clear that `1e5` can only possibly be one FloatValue and not three tokens, makes lexer errors specified clearly to remove ambiguity, and provides clear errors for mistaken input.

**Precedent**

Javascript applies this same restriction for similar reasons, I believe originally to produce an early error if C-style typed literals were used in a Javascript program.

https://www.ecma-international.org/ecma-262/10.0/index.html#sec-literals-numeric-literals

**Cost of change**

While this is *technically* a breaking change to the language grammar, it seeks to restrict cases that are almost certainly already producing either syntax or validation errors.

This is different from the current implementation of GraphQL.js and I believe other parsers, and will require minor implementation updates.
@leebyron leebyron force-pushed the rfc-number-restriction branch from 50ac83c to 4e3c343 Compare January 10, 2020 21:21
@leebyron leebyron changed the base branch from editorial-greedy-lexer to master January 10, 2020 21:21
@leebyron leebyron added 🏁 Accepted (RFC 3) RFC Stage 3 (See CONTRIBUTING.md) and removed 📄 Draft (RFC 2) RFC Stage 2 (See CONTRIBUTING.md) labels Jan 10, 2020
@leebyron leebyron merged commit 09cdaec into master Jan 10, 2020
@leebyron leebyron deleted the rfc-number-restriction branch January 10, 2020 21:25
lirsacc added a commit to lirsacc/py-gql that referenced this pull request Jan 28, 2020
Some edge cases around numbers were not handled as expected. This commit adds
test cases from the 2 RFCs clarifying the expected behaviour (
graphql/graphql-spec#601,
graphql/graphql-spec#599) and updates the Lexer to match.

This is technically a breaking change but most cases were likely to lead
to validation errors (e.g. "0xF1" being parsed as [0, xF1] when
expecting a list of integers).
@leebyron leebyron added this to the May2021 milestone Apr 9, 2021
mattstern31 pushed a commit to mattstern31/graphql-wg-membership-note that referenced this pull request Dec 1, 2022
anthonymedinawork added a commit to anthonymedinawork/py-gql that referenced this pull request Jan 24, 2024
Some edge cases around numbers were not handled as expected. This commit adds
test cases from the 2 RFCs clarifying the expected behaviour (
graphql/graphql-spec#601,
graphql/graphql-spec#599) and updates the Lexer to match.

This is technically a breaking change but most cases were likely to lead
to validation errors (e.g. "0xF1" being parsed as [0, xF1] when
expecting a list of integers).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏁 Accepted (RFC 3) RFC Stage 3 (See CONTRIBUTING.md)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants