Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify how document.cookie diverges from [COOKIES] RFC #804

Open
domenic opened this issue Mar 4, 2016 · 30 comments
Open

Specify how document.cookie diverges from [COOKIES] RFC #804

domenic opened this issue Mar 4, 2016 · 30 comments
Labels
compat Standard is not web compatible or proprietary feature needs standardizing normative change topic: cookie

Comments

@domenic
Copy link
Member

domenic commented Mar 4, 2016

Currently the spec says

the user agent must act as it would when receiving a set-cookie-string for the document's address via a "non-HTTP" API, consisting of the new value encoded as UTF-8.

However, in the real world things like document.cookie = "foo" work and have an effect. There are probably many other possibilities; in general the RFC just has a grammar that things might not match, whereas I imagine browsers just accept anything and try to make sense of it, even if it fails to match the grammar.

@bsittler noticed this while working on some service worker cookie stuff, and previously it has come up in the jsdom project and its related tough-cookie helper:

@Sebmaster and @inikulin led the charge for this in jsdom, so maybe they could help us spec the correct behavior for how document.cookie parses cookies? Alternately, looking at open-source browser code would get us pretty far.

This might be a compat issue if everyone hasn't managed to magically converge on a single behavior despite the lack of precise spec. Tentatively tagging as such for now.

@domenic domenic added normative change compat Standard is not web compatible or proprietary feature needs standardizing labels Mar 4, 2016
@annevk
Copy link
Member

annevk commented Mar 5, 2016

Paging @mikewest.

@inikulin
Copy link
Member

inikulin commented Mar 5, 2016

I'd love to help!

Here is what we can do:

  • Create test runner for the IETF test suite that will produce output in machine readable format. Currently it can run only individual tests, or requires dev builds of the browsers in some cases which renders it unusable for the testing of IE and Edge. Also it can't produce machine readable reports at the moment. We will need them to aggregate and analyze results across browsers lately. I'm already working on it.
  • Run tests in all major browsers:
    • Chrome
    • Safari
    • Firefox
    • IE
    • Edge
  • Using aggregated test fails info we can build table in format:
Test case /browser Expected Chrome Firefox Safari
"foo=" "" "foo" "foo" ""
  • Triage fails into groups, e.g. if test fails in the majority of the browsers consider it as a de facto behavior and add this difference to the spec. For the minor cases consult with the developers / search for the issue tracker tickets to find motivation behind it.
  • Modify IETF test suite by the way to align it with the proposed behavior. Make it default test suite for the spec.

I will try to provide you test results somewhere around next week.

@annevk
Copy link
Member

annevk commented Mar 6, 2016

Cool! Another thing here worth checking is <meta http-equiv=set-cookie>. If these invalid values still result in HTTP headers, it's likely the RFC will need to be updated somehow.

@inikulin
Copy link
Member

inikulin commented Mar 6, 2016

@annevk AFAIK browsers uses the same code for all cookie parsing scenarios. Spec violations in document.cookie setter also shows up when you set cookie via HTTP-header. I'm pretty sure we will have the same results with <meta>.

@annevk
Copy link
Member

annevk commented Mar 6, 2016

I see, in that case it seems like something @mikewest and @mnot should be solving in the RFC. Your testing will still be useful, obviously, but given the scope of the problem it does not seem like something that needs to be addressed in the HTML Standard. Although I can understand if we need to make adjustments for a revised RFC that does handle this properly.

@inikulin
Copy link
Member

inikulin commented Mar 6, 2016

Although I can understand if we need to make adjustments for a revised RFC that does handle this properly.

So, we will continue discussion here for now and once we will have some data and analyzis we will ping IETF guys, I guess?

@mnot
Copy link
Member

mnot commented Mar 7, 2016

Very good timing. We're about to start opening up the cookie RFC, so yes do ping us when you have some results. Any idea how long that will be?

@annevk
Copy link
Member

annevk commented Mar 7, 2016

@inikulin, yeah, we'll keep this open until the issue is resolved. @mnot, @inikulin mentioned earlier he was hoping to have something this week.

@inikulin
Copy link
Member

Voilà :neckbeard:
http://inikulin.github.io/cookie-compat/

@domenic
Copy link
Member Author

domenic commented Mar 10, 2016

OMG, this is amazing!!

@bsittler
Copy link

@inikulin this is really sobering. Thank you! What was the effective document charset for the test page?

@inikulin
Copy link
Member

@bsittler UTF-8

@inikulin
Copy link
Member

FYI test runner sources are here: https://github.com/inikulin/cookie-compat

@inikulin
Copy link
Member

Thank you guys for all the kind words, I hope you will find it useful.

Further steps:

  • Add expires= date parsing tests. They are in the separate test suite and requires conversion. (just realized what there is no way to access parsed expiration date)
  • Currently we don't have reference implementation. It bothers me. I will try to create one based on tough-cookie. Actually, tough-cookie is implemented nearly per spec with just some minor relaxations (e.g. symbols restrictions for the token are ignored).
  • Report issues for the obvious bugs to implementors and reference them in the table.

@mnot
Copy link
Member

mnot commented Mar 11, 2016

Wow indeed, really great stuff!

It seems to me that the first 17 tests could be brought into (at least rough) interop with a fairly simple spec change to Section 5.2. The remaining tests demonstrate enough interop that they look more like browser bugs to me.

That's assuming that all of the browsers don't want to fix the underlying bugs in the first 17 tests, of course. It'd be very useful to know how much content on the Web currently relies upon this behaviour, but gathering that data is likely to be problematic...

If we do want to change the spec, someone will need to write up an Internet-Draft describing the proposed changes. I can help with that.

@inikulin would you mind pinging the HTTP-WG about this on its mailing list https://lists.w3.org/Archives/Public/ietf-http-wg/? If you don't want to subscribe, I can forward a message for you, or you could even just open up a bug at https://github.com/httpwg/http-extensions/issues. I just want to make sure that you get credit for this awesome work.

@inikulin
Copy link
Member

@mnot Done: httpwg/http-extensions#159

@bsittler
Copy link

@inikulin what was the system codepage for Edge and IE? Have you tried changing it? If https://stackoverflow.com/questions/1969232/allowed-characters-in-cookies is to be believed, non-ASCII characters may "work" in IE when they are present in the system codepage, where "work" means they will be wire-encoded in that codepage (never UTF-8, since Windows system codepage can't be set to 65001) but exposed to JavaScript using the corresponding Unicode characters. I'd be especially interested to see the results for systems with larger-coverage (CJK?) or non-1252 system codepages.

Likewise, have you tried server-generated cookies with encodings other than UTF-8, e.g. latin-1?

@inikulin
Copy link
Member

Nope, haven't adjusted windows code page for tests. I'll try to run with codepages with bigger character set tomorrow at work, because I don't have access to win machine currently.

@inikulin
Copy link
Member

Likewise, have you tried server-generated cookies with encodings other than UTF-8, e.g. latin-1?

Nope

@bsittler
Copy link

bsittler commented Jun 21, 2016

One more thought: it may be worth checking both reading and writing behavior of the backslash \u005c \ and yen sign \u00a5 ¥ in cookies on the server side, from HTML (meta http-equiv=set-cookie) and from document.cookie across Latin 1, UTF-8 and Shift JIS/CP 932 document encodings and with both US English and Japanese system codepages in effect. It's a large matrix, but it may uncover some useful information about how browsers currently interoperate (or don't) in the presence of incompatible character encodings. In particular it would be good to know whether backslash is reliably round-tripped under all these circumstances and whether or not it is ever remapped to a non-ASCII character.

Same question for tilde \u007e ~ and wave dash \u301c actually.

(I'm asking these oddly specific questions because I'm wondering whether all of printable ASCII other than semicolon is actually safe in cookie values across browsers) Edit: names too (barring equal sign of course)

Edit: Also, in the meta http-equiv case, are the results the same for raw document-charset characters vs. HTML-entified versions?

more edit: Yet another IE-specific question: does document.cookie in IE (and Edge?) round-trip Unicode when the characters are first converted to bytes? e.g. document.cookie = unescape(encodeURIComponent('test=三猿🙈🙉🙊')) and decodeURIComponent(escape(document.cookie)) [or the (better) TextDecoder/TextEncoder equivalents except there's no TextDecoder/TextEncoder in IE]

@inikulin
Copy link
Member

@bsittler

I'd be especially interested to see the results for systems with larger-coverage (CJK?) or non-1252 system codepages.

I've added results for IE and Edge with system codepage 950 (big5) and 932 (shift_jis): http://inikulin.github.io/cookie-compat/ (spoiler: it didn't work out)

Regarding #804 (comment) if you wouldn't mind, I will work on it later, because I'm really running out of spare time currently. I've created issue in cookie-compat for this task to not forget about it: inikulin/cookie-compat#3

@bsittler
Copy link

Thank you very much

On Wed, Jun 22, 2016, 05:21 Ivan Nikulin notifications@github.com wrote:

@bsittler https://github.com/bsittler

I'd be especially interested to see the results for systems with
larger-coverage (CJK?) or non-1252 system codepages.

I've added results for IE and Edge with system codepage 950 (big5) and 932
(shift_jis): http://inikulin.github.io/cookie-compat/ (spoiler: it didn't
work out)

Regarding #804 (comment)
#804 (comment) if you
wouldn't mind, I will work on it later, because I'm really running out of
spare time currently. I've created issue in cookie-compat for this task to
not forget about it: inikulin/cookie-compat#3
inikulin/cookie-compat#3


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#804 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAD3R3OiA9gj3SUrChCOVgrDswuODR8Oks5qOSjVgaJpZM4HpsLj
.

@bsittler
Copy link

bsittler commented Jun 27, 2016

On Windows 7 with a US English system locale running IE 9, JavaScript-written cookies subsequently read from JavaScript seem to reliably round-trip characters whose ISO 8859-1 encodings fall in the ISO 2022 GR range (0xA0 ... 0xFF) in addition to most of printable ASCII. This seems to be the case regardless of the document character encoding. Additionally, I tried a few characters whose Windows-1252 encodings fall in the ISO 2022 C1 range (0x80 ... 0x9F) and they appear to round-trip successfully, too. Characters not representable in Windows-1252 are apparently converted to question mark (other printable characters) or dropped (ASCII control characters.)

I have not yet tested with a different system locale.

I suspect that cookies are simply serialized in the IE cookie jar using the default codepage of the system locale.

@bsittler
Copy link

Indeed, after switching the system locale to Japanese (with "ANSI" and "OEM" codepages both switched to 932) and rebooting, cookies behave exactly as if they are being stored in CP932 (approximately Shift JIS), with characters like Euro sign \u20ac converted to question mark and japanese text preserved. This is independent of document charset, so the same Japanese text written by script running in a Shift JIS document is readable by script running in a UTF-8 document without mangling, and vice versa.

@annevk
Copy link
Member

annevk commented Jun 28, 2016

Wow, that is not something we want to standardize upon. How would that even work with code points that cannot be represented by the encoding?

@bsittler
Copy link

It doesn't. They are converted to question marks (in other words, data is
lost.) Because it's based on the system "ANSI" code page it is however
somewhat likely that text entered by the user in the system locale's
primary language will round-trip successfully from script to script across
page loads. Compatibility with other modern browsers however seems to be
zero for non-ASCII text.

On Tue, Jun 28, 2016, 00:02 Anne van Kesteren notifications@github.com
wrote:

Wow, that is not something we want to standardize upon. How would that
even work with code points that cannot be represented by the encoding?


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#804 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAD3R0T0ufS3iGTcdq_8a_H49eZyMzn0ks5qQMcEgaJpZM4HpsLj
.

@bsittler
Copy link

bsittler commented Jun 28, 2016

Just did a little further testing, and verified that even with explicit UTF-8 or UTF-16 (little-endian) byte-order marks in the cookie name and/or cookie value, IE and Edge still always interpret the cookie according to the system "ANSI" codepage. Non-ASCII cookie names and values set by the server are sent back to the server without mangling, so there's nothing to prevent a server from storing UTF-8 in a cookie (e.g. UTF-8 cookie names/values containing Ő [\xc5\x90] round-trip server-to-server via US English-locale Edge even though \x90 is nominally unmapped in Windows code page 1252), however scripts running in IE always misinterpret such cookies according to the system ANSI codepage (in this case the nominally unmapped byte is in fact exposed as-is to script, as '\x90'.)

Also, attempts to set cookies from scripts with "ANSI" code page-unrepresentable characters in their names and/or values do not always convert those to question marks - sometimes a different fallback is used. For instance, with a US English system locale document.cookie = 'Ő=Ő' results in O=O instead. I suspect it's using the default substitutions from WideCharToMultiByte.

@domenic
Copy link
Member Author

domenic commented Jun 28, 2016

I'm doubtful that further testing of IE/Edge's quirks is going to be helpful. We know they do weird stuff they would never put into a web spec.

@bsittler
Copy link

bsittler commented Jun 29, 2016

Right, I was merely attempting to assess the compatibility risk of having the new API only support UTF-8 (and possibly also "raw byte array") interpretation for cookie data, which would be incompatible (in Edge) with the system "ANSI" codepage interpretation in document.cookie and <meta http-equiv="set-cookie" ...> but consistent with other browsers.

@Ms2ger
Copy link
Member

Ms2ger commented Aug 15, 2017

One "fun" thing I noticed today: document.cookie = 'foo' will add a trailing = in macOS WebKit, but not GTK+ WebKit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compat Standard is not web compatible or proprietary feature needs standardizing normative change topic: cookie
Development

No branches or pull requests

6 participants