Allow unicode characters outside the Basic Multilingual Plane #214

dylanahsmith · 2016-09-29T19:22:07Z

Currently a GraphQL document is only allows a SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/ and EscapedUnicode :: /[0-9A-Fa-f]{4}/ also prevents unicode characters above U+FFFF from being included into a GraphQL string.

Unicode code points are actually in the range 0 to 0x10FFFF. For example, unicode emoji characters like 😀 (U+1F600) have code points above U+FFFF.

Is there any reason why the source document doesn't allow unicode characters above U+FFFF? Or can we remove that restriction? Without that restriction the limitation of the unicode escape doesn't seem problematic.

If supporting a unicode escape for all unicode characters is desired, then one way of handling that is the way swift supports unicode escapes:

An arbitrary Unicode scalar, written as \u{n}, where n is a 1–8 digit hexadecimal number with a value equal to a valid Unicode code point

The text was updated successfully, but these errors were encountered:

chris-morgan · 2016-10-14T09:41:58Z

I also was reading the spec and realised this. Given the paragraph around it:

GraphQL documents are expressed as a sequence of Unicode characters. However, with few exceptions, most of GraphQL is expressed only in the original non‐control ASCII range so as to be as widely compatible with as many existing tools, languages, and serialization formats as possible and avoid display issues in text editors and source control.

It sounds to me an error of ignorance rather than intent.

Rust formerly had \uXXXX and \UXXXXXXXX, but changed to \u{x} (the same as Swift does) some time before Rust 1.0.

JSON (which is probably the main inspiration for the GraphQL syntax) does \uXXXX and uses the abomination that is UTF-16 surrogate pairs as a way of representing higher-order characters, e.g. U+1F600 (😀) is escaped as "\ud83d\ude00" in JSON.

Fortunately you can avoid that insanity by simply expressing values literally. There’s no real need for the escapes anyway once you get past U+001F (\u00XX) and U+0022 (\"). (Unless you deal with combining characters that will attach to a string’s quotation marks, which is fearfully ugly and points out the grammatical problem of parsing by codepoint rather than grapheme cluster, but this is all more advanced stuff that we wish wouldn’t happen in real life, anyway.)

Also currently the escapes listed (EscapedCharacter) match those of JSON. (I think. As to the interpretation, what the GraphQL spec actually says is that \f would be U+0066, “f”, rather than U+000C which is what we all know it’s supposed to be. It’s really badly written.) Given that general tie, supporting \uXXXX might not be a terrible idea, with or without \u{X}.

The definition of the handling of EscapedUnicode is also extremely tacky, with spelling errors, poorly defined terms, &c.:

Return the character value represented by the UTF16 hexidecimal identifier EscapedUnicode.

What does that even mean? Seriously, that doesn’t make sense.

This stuff all suggests to me that it was written by someone with a poor understanding of Unicode. This spec gravely needs both editorial and technical review.

I want to see how different implementations parse:

"\ud83d\ude00": nonsensical in the current specification. If GraphQL wants to be like JSON, handling it as UTF-16 surrogate pairs is probably a good idea. If not (please don’t go for surrogate pairs!), the grammar needs to be changed to allow for the supplemental planes (such as via \u{1F600}).
"😀": illegal in the current specification, shouldn’t tokenise. However, I hope that implementations accept it and treat it as a string containing the code point U+1F600.

leebyron · 2016-10-28T19:49:33Z

Thanks for bringing this up! Great thought process already happening.

I agree that surrogate pairs is an obtuse API. I'd like to avoid it if possible, though there is one serious upside to consider: it mirrors JSON. That might not be enough to motivate it as the solution, but it certainly shouldn't be discredited.

Here are some action items:

Editing of the language section describing Unicode to correct and clarify.
Propose expanding the parsable character set to all represented in latest Unicode including supplemental planes.
Propose a new escape sequence for string literals (or prescribe to always use Unicode characters directly) for supplemental planes.

This clears up the language for unicode escape sequences in strings, and adds a conversion table to remove ambiguity from character escape sequences. Suggested by #214

leebyron · 2016-10-28T21:58:56Z

@dylanahsmith and @chris-morgan I'd love your feedback on #231

This clears up the language for unicode escape sequences in strings, and adds a conversion table to remove ambiguity from character escape sequences. Suggested by graphql#214

Nabellaleen · 2019-06-13T12:32:39Z

👍 for this spec' !

To be able to build an enum like

enum MOOD {
  😩
  😞
  😕
  😐
  🙂
  😃
}

chris-morgan · 2019-06-13T12:46:11Z

@Nabellaleen Allowing emoji in an identifier is a completely different thing from allowing it in a source document, which is mostly for the sake of strings. And allowing emoji in identifiers is generally a poor idea; most languages stick with UAX #31’s definition for identifiers.

andimarek · 2019-07-15T05:30:31Z

I would be interested to revive this discussion: I don't see a reason for restricting it and we already see implementations and APIs having descriptions with emojis in it (e.g. github API).

andimarek · 2019-07-16T03:33:29Z

fyi: graphql-ruby supports all unicode chars (cc @rmosolgo) and we decided to do the same for GraphQL Java.

lumberman · 2020-01-10T02:02:31Z

Build fail if you have emoji in the path.

andimarek · 2020-02-08T05:56:55Z

I created a new issue which outlines proposed changes to the spec to allow for full unicode support: #687

leebyron added a commit that referenced this issue Oct 28, 2016

[Clarification] Escape sequences in strings.

c9b6827

This clears up the language for unicode escape sequences in strings, and adds a conversion table to remove ambiguity from character escape sequences. Suggested by #214

leebyron mentioned this issue Oct 28, 2016

[RFC] Support full Unicode character range #231

Closed

leebyron added 👻 Needs Champion RFC Needs a champion to progress (See CONTRIBUTING.md) 💭 Strawman (RFC 0) RFC Stage 0 (See CONTRIBUTING.md) labels Oct 2, 2018

jimkyndemeyer mentioned this issue Apr 18, 2019

Support unicode in schema comments (affects GitHub v4 API) JetBrains/js-graphql-intellij-plugin#246

Closed

andimarek mentioned this issue Jul 15, 2019

Unicode characters not supported in graphql descriptions graphql-java/graphql-java#1577

Closed

zombiezen mentioned this issue Dec 24, 2019

Address GraphQL string literals with characters outside BMP zombiezen/graphql-server#34

Open

andimarek mentioned this issue Feb 8, 2020

allow full unicode range #687

Closed

dkbarn mentioned this issue Apr 12, 2021

Unicode characters get transformed into surrogate pairs by graphql.print_ast() graphql-python/graphql-core#128

Closed

leebyron mentioned this issue Apr 13, 2021

RFC: Allow full unicode range #849

Merged

leebyron closed this as completed in #849 Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow unicode characters outside the Basic Multilingual Plane #214

Allow unicode characters outside the Basic Multilingual Plane #214

dylanahsmith commented Sep 29, 2016

chris-morgan commented Oct 14, 2016

leebyron commented Oct 28, 2016 •

edited

Loading

leebyron commented Oct 28, 2016

Nabellaleen commented Jun 13, 2019

chris-morgan commented Jun 13, 2019

andimarek commented Jul 15, 2019

andimarek commented Jul 16, 2019

lumberman commented Jan 10, 2020

andimarek commented Feb 8, 2020

Allow unicode characters outside the Basic Multilingual Plane #214

Allow unicode characters outside the Basic Multilingual Plane #214

Comments

dylanahsmith commented Sep 29, 2016

chris-morgan commented Oct 14, 2016

leebyron commented Oct 28, 2016 • edited Loading

leebyron commented Oct 28, 2016

Nabellaleen commented Jun 13, 2019

chris-morgan commented Jun 13, 2019

andimarek commented Jul 15, 2019

andimarek commented Jul 16, 2019

lumberman commented Jan 10, 2020

andimarek commented Feb 8, 2020

leebyron commented Oct 28, 2016 •

edited

Loading