-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unicode fragments #150
Comments
Ideally, I think we'd all just switch to percent encoding all non-ASCII characters in all fragments so the output of URL parsing is always ASCII. Would everyone be willing to make such a change? |
https://bugs.webkit.org/show_bug.cgi?id=163153 Reviewed by Tim Horton. Source/WebCore: This is needed to keep compatibility with data URLs with non-ASCII characters after a '#' which works in Chrome, Firefox, and Safari, while maintaining compatibility with Chrome, IE, and Edge which keep non-ASCII characters in the fragments of special URLs. This was proposed to the spec in whatwg/url#150 Covered by new API tests. * platform/URLParser.cpp: (WebCore::URLParser::syntaxViolation): Removed assertion because we now have fragments that need percent encoding but are all ASCII. (WebCore::URLParser::fragmentSyntaxViolation): (WebCore::URLParser::parse): Tools: * TestWebKitAPI/Tests/WebCore/URLParser.cpp: (TestWebKitAPI::TEST_F): git-svn-id: http://svn.webkit.org/repository/webkit/trunk@206942 268f45cc-cd09-0410-ab3c-d52691b4dbfc
@valenting? I think @dbaron also wanted us to always percent-encode, to keep URLs "ASCII safe" and easier to copy-and-paste (due to whitespace and such). |
I support percent encoding all characters in the hash. It turns out percent encoded strings are much easier to manage and parse. |
…s in fragment https://bugs.webkit.org/show_bug.cgi?id=163287 Reviewed by Brady Eidson. Source/WebCore: Based on discussion in whatwg/url#150 If that discussion decides to keep the spec as-is (which keeps non-ASCII characters in the fragment to match IE and Edge's behavior, which Chrome has followed for special schemes) then we can revert this change later after enabling the URL parser. Making this change keeps behavior matching Safari and Firefox, as well as Chrome's handling of non-special schemes, such as data URLs. Covered by updated API tests. * platform/URLParser.cpp: (WebCore::URLParser::appendToASCIIBuffer): (WebCore::URLParser::copyURLPartsUntil): (WebCore::URLParser::syntaxViolation): (WebCore::URLParser::currentPosition): (WebCore::URLParser::parse): (WebCore::URLParser::fragmentSyntaxViolation): Deleted. * platform/URLParser.h: No more non-ASCII characters in canonicalized URLs. Tools: * TestWebKitAPI/Tests/WebCore/URLParser.cpp: (TestWebKitAPI::TEST_F): git-svn-id: http://svn.webkit.org/repository/webkit/trunk@207152 268f45cc-cd09-0410-ab3c-d52691b4dbfc
It seems HTML already decodes fragment identifiers in https://html.spec.whatwg.org/multipage/browsers.html#the-indicated-part-of-the-document so presumably that would keep working. But I suspect there might be websites that depend on fragment not being encoded? @achristensen07 did you verify that all APIs for URLs in WebKit return the URL with the fragment encoded as well? Or in order to have URLs with the fragment encoded would we need to add special casing to APIs? |
In WebKit, the fragments are encoded/serialized as they are parsed, so there is no time in which there will be an unencoded fragment in a URL |
@achristensen07 that's understood, but my question is if there's any APIs such as |
I do not think we ever percent-decode fragments in WebKit. Going to the indicated part of the document with percent-encoded fragments or non-ASCII fragments which we percent encode doesn't work in WebKit right now, which we need to fix. |
In particular, the argument I made was in Mozilla bug 1148861 (comment 0 and comment 25). |
So Gecko seems to be encoding too (although the address bar decodes). So that leaves Chrome and Edge. @foolip who's a good contact for URL parsing in Chrome? |
@sleevi I guess the question is then whether you'd consider always encoding fragments. (Or is that already happening and is the Blink side doing magic in the API layer?) |
@annevk Yeah, I'm planning on looking into this today and trying to get an answer. I seem to recall that it's similar to Gecko, and so I suspect our URL substrate is decoding, and any encoding would need to be done at the Blink glue to that (the KURL and bindings side). On the upside, it makes it easier to change when it would only affect Chrome (our URL code is shared with a growing number of Google products due to Cronet embedding), the downside is it may require a more extensive audit of where we expose those URLs to a web-visible form. I'm not hearing a request to change address bar behavior, but did I misunderstand the remarks re: FF? My gut is that it seems to make sense to change that too, to ensure authors aren't going to copy/paste? |
The address bar is a hard problem. If we always show everything encoded, we basically make it useless for some set of pages and people, e.g., Wikipedia articles with titles in Kanji visited by folks that can read Japanese. On the other hand, I don't think most people should really be looking at the address bar beyond the domain (and the rest can be hidden until you need to copy), but folks are going to disagree on that one and only Safari is really that far ahead right now. There's also the aspect that we cannot dictate the address bar since it's UX. And you could potentially show decoded, but copy encoded, etc. I think for this issue we should consider it out-of-scope, and just consider the effects we can directly observe through JavaScript and following links in documents. |
@sleevi any update? I'm inclined to go with always encoding the fragment since it would be the most consistent (URL members would only ever contain ASCII) and it appears that's what Firefox and Safari get away with. |
I created a PR to change the specification here. I'd appreciate review. @achristensen07 do you know if this is already tested? I can change the tests. |
FWIW, I started working on a patch for the tests and I noticed that Firefox has a few differences from my proposed change. It does not seem to drop U+0000 and it seems to preserve and encode newline and tab code points. Not sure that's what we want to keep given what we specified for other setters. |
This is tested by these existing tests: |
web-platform-tests/wpt#4298 has an update to the tests. I want to be sure though about what we want to encode, although I guess this change is still a step in the right direction so maybe we should make it regardless. @valenting thoughts? Should we encode spaces as well? Is it okay to drop U+0000? |
Right now Firefox drops \r\n\t in the URL constructor, and trims U+0000 to U+0020. |
Chrome and Safari don't encode spaces. I propose using the simple encode set |
@valenting the consistency we have now is that \r\n\t are always dropped. And trimming of U+0000 to U+0020 happens only for the full URL (not setters). Note that it should also happen for The main problem with not encoding spaces is that it's harder to copy-and-paste those URLs and drop them elsewhere. Since most tooling assumes URLs do not contain spaces. Anyway, if the test changes look okay I suggest we land this and then if there are still concerns we can fiddle with the details in a new issue. |
👍 |
Related issue #125 |
Thanks @zcorpan, we can use that as the follow up for spaces. |
Now all components of a URL can be represented using ASCII strings or integers. Tests: web-platform-tests/wpt#4298. Fixes #150.
This patch contains the following changes: url: make IPv4 parser more spec compliant * Return int64_t from ParseNumber to prevent overflow for valid big numbers * Don't throw when there are more than 4 parts (it cannot be an IP address) * Correctly interpret the address and don't always throw when there are numbers > 255 Ref: https://url.spec.whatwg.org/#concept-ipv4-parser Fixes: nodejs#10306 url: percent encode fragment to follow spec change Ref: whatwg/url#150 Ref: whatwg/url@373dbed url: fix URL#search setter The check for empty string must be done before removing the leading '?'. Ref: https://url.spec.whatwg.org/#dom-url-search url: set port to null if an empty string is given This is to follow a spec change. Ref: whatwg/url#113 url: fix parsing of paths with Windows drive letter test: update WHATWG URL test fixtures PR-URL: nodejs#10317 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
This patch contains the following changes: url: make IPv4 parser more spec compliant * Return int64_t from ParseNumber to prevent overflow for valid big numbers * Don't throw when there are more than 4 parts (it cannot be an IP address) * Correctly interpret the address and don't always throw when there are numbers > 255 Ref: https://url.spec.whatwg.org/#concept-ipv4-parser Fixes: #10306 url: percent encode fragment to follow spec change Ref: whatwg/url#150 Ref: whatwg/url@373dbed url: fix URL#search setter The check for empty string must be done before removing the leading '?'. Ref: https://url.spec.whatwg.org/#dom-url-search url: set port to null if an empty string is given This is to follow a spec change. Ref: whatwg/url#113 url: fix parsing of paths with Windows drive letter test: update WHATWG URL test fixtures PR-URL: #10317 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
This patch contains the following changes: url: make IPv4 parser more spec compliant * Return int64_t from ParseNumber to prevent overflow for valid big numbers * Don't throw when there are more than 4 parts (it cannot be an IP address) * Correctly interpret the address and don't always throw when there are numbers > 255 Ref: https://url.spec.whatwg.org/#concept-ipv4-parser Fixes: #10306 url: percent encode fragment to follow spec change Ref: whatwg/url#150 Ref: whatwg/url@373dbed url: fix URL#search setter The check for empty string must be done before removing the leading '?'. Ref: https://url.spec.whatwg.org/#dom-url-search url: set port to null if an empty string is given This is to follow a spec change. Ref: whatwg/url#113 url: fix parsing of paths with Windows drive letter test: update WHATWG URL test fixtures PR-URL: #10317 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
…s in fragment https://bugs.webkit.org/show_bug.cgi?id=163287 Reviewed by Brady Eidson. Source/WebCore: Based on discussion in whatwg/url#150 If that discussion decides to keep the spec as-is (which keeps non-ASCII characters in the fragment to match IE and Edge's behavior, which Chrome has followed for special schemes) then we can revert this change later after enabling the URL parser. Making this change keeps behavior matching Safari and Firefox, as well as Chrome's handling of non-special schemes, such as data URLs. Covered by updated API tests. * platform/URLParser.cpp: (WebCore::URLParser::appendToASCIIBuffer): (WebCore::URLParser::copyURLPartsUntil): (WebCore::URLParser::syntaxViolation): (WebCore::URLParser::currentPosition): (WebCore::URLParser::parse): (WebCore::URLParser::fragmentSyntaxViolation): Deleted. * platform/URLParser.h: No more non-ASCII characters in canonicalized URLs. Tools: * TestWebKitAPI/Tests/WebCore/URLParser.cpp: (TestWebKitAPI::TEST_F): Canonical link: https://commits.webkit.org/181119@main git-svn-id: https://svn.webkit.org/repository/webkit/trunk@207152 268f45cc-cd09-0410-ab3c-d52691b4dbfc
Right now there is a note: "Unfortunately not using percent-encoding is intentional as implementations with majority market share exhibit this behavior."
I think we should change "Append c to url’s fragment." to at least "If url’s scheme is a special scheme, append c to url’s fragment. Otherwise, append the result of running UTF-8 percent encode with the simple encode set on c" or something similar.
Right now Firefox and Safari percent-encode non-ASCII characters in the fragment. Chrome percent-encodes non-ASCII characters in the fragment if the scheme is not a special scheme. Edge does not percent-encode any non-ASCII characters in the fragment.
This change is needed to continue to support percent-encoding non-ascii characters after a '#' in a data URL, which works in Safari, Chrome, and Firefox.
The text was updated successfully, but these errors were encountered: