Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow unicode characters outside the Basic Multilingual Plane #214

Closed
dylanahsmith opened this issue Sep 29, 2016 · 9 comments · Fixed by #849
Closed

Allow unicode characters outside the Basic Multilingual Plane #214

dylanahsmith opened this issue Sep 29, 2016 · 9 comments · Fixed by #849
Labels
👻 Needs Champion RFC Needs a champion to progress (See CONTRIBUTING.md) 💭 Strawman (RFC 0) RFC Stage 0 (See CONTRIBUTING.md)

Comments

@dylanahsmith
Copy link
Contributor

Currently a GraphQL document is only allows a SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/ and EscapedUnicode :: /[0-9A-Fa-f]{4}/ also prevents unicode characters above U+FFFF from being included into a GraphQL string.

Unicode code points are actually in the range 0 to 0x10FFFF. For example, unicode emoji characters like 😀 (U+1F600) have code points above U+FFFF.

Is there any reason why the source document doesn't allow unicode characters above U+FFFF? Or can we remove that restriction? Without that restriction the limitation of the unicode escape doesn't seem problematic.

If supporting a unicode escape for all unicode characters is desired, then one way of handling that is the way swift supports unicode escapes:

An arbitrary Unicode scalar, written as \u{n}, where n is a 1–8 digit hexadecimal number with a value equal to a valid Unicode code point

@chris-morgan
Copy link
Contributor

I also was reading the spec and realised this. Given the paragraph around it:

GraphQL documents are expressed as a sequence of Unicode characters. However, with few exceptions, most of GraphQL is expressed only in the original non‐control ASCII range so as to be as widely compatible with as many existing tools, languages, and serialization formats as possible and avoid display issues in text editors and source control.

It sounds to me an error of ignorance rather than intent.

Rust formerly had \uXXXX and \UXXXXXXXX, but changed to \u{x} (the same as Swift does) some time before Rust 1.0.

JSON (which is probably the main inspiration for the GraphQL syntax) does \uXXXX and uses the abomination that is UTF-16 surrogate pairs as a way of representing higher-order characters, e.g. U+1F600 (😀) is escaped as "\ud83d\ude00" in JSON.

Fortunately you can avoid that insanity by simply expressing values literally. There’s no real need for the escapes anyway once you get past U+001F (\u00XX) and U+0022 (\"). (Unless you deal with combining characters that will attach to a string’s quotation marks, which is fearfully ugly and points out the grammatical problem of parsing by codepoint rather than grapheme cluster, but this is all more advanced stuff that we wish wouldn’t happen in real life, anyway.)

Also currently the escapes listed (EscapedCharacter) match those of JSON. (I think. As to the interpretation, what the GraphQL spec actually says is that \f would be U+0066, “f”, rather than U+000C which is what we all know it’s supposed to be. It’s really badly written.) Given that general tie, supporting \uXXXX might not be a terrible idea, with or without \u{X}.

The definition of the handling of EscapedUnicode is also extremely tacky, with spelling errors, poorly defined terms, &c.:

Return the character value represented by the UTF16 hexidecimal identifier EscapedUnicode.

What does that even mean? Seriously, that doesn’t make sense.

This stuff all suggests to me that it was written by someone with a poor understanding of Unicode. This spec gravely needs both editorial and technical review.


I want to see how different implementations parse:

  • "\ud83d\ude00": nonsensical in the current specification. If GraphQL wants to be like JSON, handling it as UTF-16 surrogate pairs is probably a good idea. If not (please don’t go for surrogate pairs!), the grammar needs to be changed to allow for the supplemental planes (such as via \u{1F600}).
  • "😀": illegal in the current specification, shouldn’t tokenise. However, I hope that implementations accept it and treat it as a string containing the code point U+1F600.

@leebyron
Copy link
Collaborator

leebyron commented Oct 28, 2016

Thanks for bringing this up! Great thought process already happening.

I agree that surrogate pairs is an obtuse API. I'd like to avoid it if possible, though there is one serious upside to consider: it mirrors JSON. That might not be enough to motivate it as the solution, but it certainly shouldn't be discredited.

Here are some action items:

  • Editing of the language section describing Unicode to correct and clarify.
  • Propose expanding the parsable character set to all represented in latest Unicode including supplemental planes.
  • Propose a new escape sequence for string literals (or prescribe to always use Unicode characters directly) for supplemental planes.

leebyron added a commit that referenced this issue Oct 28, 2016
This clears up the language for unicode escape sequences in strings, and adds a conversion table to remove ambiguity from character escape sequences.

Suggested by #214
@leebyron
Copy link
Collaborator

@dylanahsmith and @chris-morgan I'd love your feedback on #231

IvanGoncharov pushed a commit to IvanGoncharov/graphql that referenced this issue Jun 17, 2017
This clears up the language for unicode escape sequences in strings, and adds a conversion table to remove ambiguity from character escape sequences.

Suggested by graphql#214
@leebyron leebyron added 👻 Needs Champion RFC Needs a champion to progress (See CONTRIBUTING.md) 💭 Strawman (RFC 0) RFC Stage 0 (See CONTRIBUTING.md) labels Oct 2, 2018
@Nabellaleen
Copy link

👍 for this spec' !

To be able to build an enum like

enum MOOD {
  😩
  😞
  😕
  😐
  🙂
  😃
}

@chris-morgan
Copy link
Contributor

@Nabellaleen Allowing emoji in an identifier is a completely different thing from allowing it in a source document, which is mostly for the sake of strings. And allowing emoji in identifiers is generally a poor idea; most languages stick with UAX #​31’s definition for identifiers.

@andimarek
Copy link
Contributor

I would be interested to revive this discussion: I don't see a reason for restricting it and we already see implementations and APIs having descriptions with emojis in it (e.g. github API).

@andimarek
Copy link
Contributor

fyi: graphql-ruby supports all unicode chars (cc @rmosolgo) and we decided to do the same for GraphQL Java.

@lumberman
Copy link

Build fail if you have emoji in the path.

@andimarek
Copy link
Contributor

I created a new issue which outlines proposed changes to the spec to allow for full unicode support: #687

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👻 Needs Champion RFC Needs a champion to progress (See CONTRIBUTING.md) 💭 Strawman (RFC 0) RFC Stage 0 (See CONTRIBUTING.md)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants