-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguity on what a character is (regardless of encoding, in two usages) #31
Comments
it is worth noting that the wire format is usually either JSON or Msgpack. For the latter, Msgpack, strings are always UTF-8 encoded, so any handling of UTF-16 is an implementation specific question. Some languages, Go or Erlang for instance, have direct UTF-8 representations of Unicode built-in, so they don't worry too much about the UTF-16 question. For JSON, you can pick any encoding (the RFC mentions a SHALL for UTF-8, UTF-16 and UTF-32) with the default being UTF-8 in most cases. The same thing applies: how the implementation maps this into UTF-16 is specific to that implementation. As for tags, I imagined tags to be from the ASCII character set only, because this is what most implementations will assume. If you opt to extend the format with an emoji as a tag, I don't think it will cause trouble. Any parser has to read either JSON or Msgpack which means that they should already have internalized their data according to that parser. If you present an emoji, they should in principle be able to read that. However, I think that you will find that some implementations will flounder on this and reject your emoji-tag. In short, I expect tags to pass the regex Finally, as for OCaml, I skipped totally on this in |
@jlouis Thanks for responding. For questions 3–8, restricting tag characters to In addition, even if only
The thing about question 1 is that it's not affected by which particular encoding form (UTF-8 or UTF-16) is used to code string values—it’s about which string values are allowed for For instance, if a Similarly, the read handler of the
The scalar But the scalar cannot be read into Java, C#, or Dart, because Question 1 asks whether Java/C#/etc.’s inability to represent SMP characters with its built-in char types should prohibit SMP characters from being If the answer is, “Yes, SMP characters are prohibited from being values of |
The specification refers to “characters” in two separate places:
c
scalar type, which is an extension type of thes
type.But there are are many definitions of “character”, so both of these usages are ambiguous.
Usage 1
Usage 1 has the following ambiguities/issues:
char
types allow only Basic Plane (BMP) code points (since those are what can be represented by single 16-byte code units), their Transit libraries’ current implementations interpret Transitc
scalars as thosechar
types when reading Transit data.c
values. SMP characters include, in particular, many symbols and emoji in use, such as 𝄫, 😀, 🐴, and 📞, and are supported by languages such as Go (with its nativerune
type), as well as quasi-supported by any language that uses strings for characters (JavaScript, Python, Ruby, etc.). Support for SMP characters may be especially important in internationalization projects (indeed, for many of my own projects).Character
andString
types, for instance, use eight-bit bytes, which essentially covers only ASCII. Ruby’sString
class also splits into eight-bit bytes, but since it has no concept of a “Character
” class, this is mostly moot anyway.char
type is BMP-only / strongly coupled to 16 bytes), does not allow Supplementary Plane characters.So, question 1 is: What values is the
c
scalar Transit type allowed to contain (by its read and write handlers)? I see at least three options:c
value is any single Unicode code point (from U+0 to U+10FFFF). This allows characters in Supplementary Planes such as emoji to be easily interchanged between programming languages whose “character” types support them (Python, sort of Ruby and JavaScript). However, it also will necessitate Transit readers in languages with 16-byte, BMP-onlychar
types to throw errors or use other data types if they encounter any SMP characters. (Note that this is already a problem in general: for instance, if a Transit UUID is invalid, a runtime error may still occur in some readers.)c
value is any single 16-byte, BMP-only (from 0000 to FFFF). No Transit program can interchange SMP characters such as emoji using the corec
type; however, 16-bit-char programming languages such as Java and C# are guaranteed to accept anyc
value. (Of note is that, in this case, people who need to use also SMP characters can define an extension type, but unfortunately this ceases to be universal.)C
” or “y
” or something, which extendss
and which represents a single, potentially Supplementary Unicode code point between U+0 and U+10FFFF. Languages that map the already-existingc
type to their 16-bit-char types would map this new type to their string types or something.I personally anticipate option B1 to be chosen, since it's what Clojure itself does and takes the least work,
but I'm still throwing option A and B2 in the hopes that they too would be consideredI now prefer that chars be clearly equated to UTF-16 code units after reading the Unicode FAQs discuss preferring UTF-16 code units for low-level indexing and strings for everything else. Any way would create more work for someone, but the question is which one is most worth it, and the specification probably should clarify this matter in any case.Question 2: If the answer for question 1 is “16-byte, BMP code points only / no SMP characters allowed in Transit
c
values”, then should Transit writers (in those languages that support SMP characters) ensure that no Supplementary code points are ever written into Transit data as Transitc
values?Usage 2
For usage 2, there are multiple questions to be clarified:
Question 3: Are only 16-byte/BMP code points or any Unicode code point allowed to be used as scalar-type tags?
Question 4: If a single SMP character is used as a type tag, is it a scalar tag (because it is a single Unicode code point) or is it a composite tag (because it is two 16-byte surrogate units)? (This is essentially equivalent to question 2.)
Question 5: Are whitespace characters allowed in type tags?
Question 6: Are control characters allowed in type tags?
Question 7: Are noncharacter code points (such as U+FDD0 and U+FFFE) allowed in type tags?
Question 8: If the answers for question 3 is “only 16-bytes/BMP points for scalar tags” or for questions 5–7 is “no”, then should writers ensure that that the prohibited code points are never used?
These are fastidious, technical questions, but I think they're important to disambiguating Transit's behavior. Question 1 especially affects how people like me use Transit. Thanks!
The text was updated successfully, but these errors were encountered: