Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON.stringify UTF-8 vs. UTF-16 #2387

Open
jmm opened this issue Apr 15, 2021 · 6 comments
Open

JSON.stringify UTF-8 vs. UTF-16 #2387

jmm opened this issue Apr 15, 2021 · 6 comments

Comments

@jmm
Copy link
Contributor

jmm commented Apr 15, 2021

Hello,

Since ES2019 the Introduction section says:

requiring that JSON.stringify return well-formed UTF-8 regardless of input

I don't think that's what it really means to say though, is it? I think it means to say that it returns well-formed UTF-16 (and as a result the content could be encoded as UTF-8)?

@mathiasbynens
Copy link
Member

It should say that it returns well-formed Unicode strings.

@jmm
Copy link
Contributor Author

jmm commented Apr 15, 2021

Thanks for the feedback. Should the JSON.stringify section not say this then?:

The stringify function returns a String in UTF-16 encoded JSON format

@jmdyck
Copy link
Collaborator

jmdyck commented Apr 15, 2021

(The intro sentence was added in 362cb10, presumably to summarize the effect of merging PR #1396.)

@mathiasbynens
Copy link
Member

Thanks for the feedback. Should the JSON.stringify section not say this then?:

The stringify function returns a String in UTF-16 encoded JSON format

I think so. There's nothing special about the encoding of the returned string — it’s just a JavaScript string, like other JavaScript strings. (And yes, JavaScript does treat strings kind of like UCS-2/UTF-16, but that's not special for JSON.stringify’s return values.) What’s special is that the returned string is guaranteed to be well-formed Unicode.

@jmm
Copy link
Contributor Author

jmm commented Apr 16, 2021

Ok thanks. I'll preface this by acknowledging that I'm not tremendously well versed on this topic (though I'm significantly more informed than a few days ago, thanks in no small part to your "Well-formed JSON.stringify" proposal and "JavaScript’s internal character encoding" article).

Those are good points and I'm mostly in alignment with you. I've thought about it further and I think it probably does make sense to be more explicit on what it returns than "JSON string" though -- whether by referencing "UTF-16" or "well-formed Unicode".

"6.1.4 The String Type" seems a bit vague. It says:

[...] operations that further interpret String contents as sequences of Unicode code points encoded in UTF-16 must account for ill-formed subsequences. Such operations apply special treatment [...]
[...]
A code unit that is a leading surrogate or trailing surrogate, but is not part of a surrogate pair, is interpreted as a code point with the same value.

So I don't read that as saying those operations will necessarily return well-formed UTF-16 / Unicode.

On another note, "Well-formed JSON.stringify" says:

[...] consumers may still reject input that specifies strings including Unicode code points that are not scalar values [...], but those that accept it must have mechanisms for dealing with unpaired surrogates (as mentioned in the specification of JSON).

Referencing RFC 8259, which says:

The behavior of software that receives JSON texts containing [unpaired surrogates] is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions.

(The ES spec references ECMA-404, which doesn't seem to say anything like that.)

Taking both of those things into account, I actually now think the most useful thing to say would be something to the effect that it returns a UTF-16 encoded or well-formed Unicode string regardless of the presence of unpaired surrogates in the input, but the JSON encoding still represents ill-formed Unicode text containing unpaired surrogates and results of parsing it (other than via JSON.parse) may be unpredictable.

@mathiasbynens
Copy link
Member

cc @gibson042

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants