-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reparsing problem with non-special URL and double-dot path component #415
Comments
What should the correct behavior be? I'm thinking this should serialize to |
According to https://quuz.org/url/liveview.html#a:///a , Though I'm not sure if the "empty host" that |
Looking at other implementations:
|
For https://jsdom.github.io/whatwg-url/#url=YTovYS8uLi8vYQ==&base=YWJvdXQ6Ymxhbms= Safari behaves per spec (and has a reparsing issue), Firefox seems to fail to parse the URL, and Chrome gives I guess when we decide to use relative URL handling we should really set the host to a non-null value. |
Interesting aside: I don't think it fails to parse. As far as I can tell it's parsing it as having protocol Weirder and weirder, |
This is a willfull violation of the URL specification for serialization. It retains spec-compliance for parsing, it aligns with the behavior of Microsoft Edge, and it "fixes" an acknowledged specification bug. See whatwg/url#415 Fixes servo#397
This is a willfull violation of the URL specification for serialization. It retains spec-compliance for parsing, it aligns with the behavior of Microsoft Edge, and it "fixes" an acknowledged specification bug. See whatwg/url#415 Fixes servo#397
This is a willfull violation of the URL specification for serialization. It retains spec-compliance for parsing, it aligns with the behavior of Microsoft Edge, and it "fixes" an acknowledged specification bug. See whatwg/url#415 Fixes servo#397
If I'm reading https://tools.ietf.org/html/rfc3986#section-5.2.2 correctly that has this problem as well. It forbids a scheme followed by an absolute path from taking the form https://tools.ietf.org/html/rfc3986#section-5.3 does not account for it either. I suggested changing the host above, but upon re-reading this issue I rather like what Edge did here as it is isolated to serialization and preserves intent better. |
Adjust the URL serializer so it does not output something which during the next parse operation would yield a host. whatwg-url: ... Tests: ... Fixes #415.
@valenting @achristensen07 this is worth fixing, right? It's only for non-special schemes, but it seems those should be free of idempotency issues too. |
Adjust the URL serializer so it does not output something which during the next parse operation would yield a host. whatwg-url: ... Tests: ... Fixes #415.
See whatwg/url#415 and whatwg/url#505 for context.
@valenting @achristensen07 ping. |
I think it makes sense that non-special URLs should also be idempotent. 👍 |
See whatwg/url#415 and whatwg/url#505 for context.
Adjust the URL serializer so it does not output something which during the next parse operation would yield a host. whatwg-url: jsdom/whatwg-url#148. Tests: web-platform-tests/wpt#25113. Fixes #415.
…testonly Automatic update from web-platform-tests Test non-special URLs are idempotent See whatwg/url#415 and whatwg/url#505 for context. -- wpt-commits: 551c9d604fb8b97d3f8c65793bb047d15baddbc2 wpt-pr: 25113
…testonly Automatic update from web-platform-tests Test non-special URLs are idempotent See whatwg/url#415 and whatwg/url#505 for context. -- wpt-commits: 551c9d604fb8b97d3f8c65793bb047d15baddbc2 wpt-pr: 25113
https://bugs.webkit.org/show_bug.cgi?id=215762 Reviewed by Tim Horton. LayoutTests/imported/w3c: * web-platform-tests/url/a-element-expected.txt: * web-platform-tests/url/a-element-xhtml-expected.txt: * web-platform-tests/url/url-constructor-expected.txt: * web-platform-tests/url/url-setters-expected.txt: Source/WTF: whatwg/url#505 added an interesting edge case to the URL serialization: "If url’s host is null, url’s path’s size is greater than 1, and url’s path[0] is the empty string, then append U+002F (/) followed by U+002E (.) to output." The problem was that URLs like "a:/a/..//a" would be parsed into "a://a" with a pathname of "//a" and an empty host. If "a://a" was then reparsed, it would again have an href of "a://a" but its host would be "a" and it would have an empty path. There is consensus that URL parsing should be idempotent, so we need to do something different here. According to whatwg/url#415 (comment) this follows what Edge did (and then subsequently abandoned when they switched to Chromium) to make URL parsing idempotent by adding "/." before the path in the edge case of a URL with a non-special scheme (not http, https, wss, etc.) and a null host and a non-empty path that has an empty first segment. All the members of the URL remain unchanged except the full serialization (href). This is not important in practice, but important in theory. Our URL parser tries very hard to use the exact same WTF::String object given as input if it can. However, this step is better implemented as a post-processing step that will almost never happen because otherwise we would have to parse the entire path twice to find out if we need to add "./" or if the "./" that may have already been there needs to stay. This is illustrated with the test URL "t:/.//p/../../../..//x" which does need the "./". In the common case, this adds one well-predicted branch to URL parsing, so I expect performance to be unaffected. Since this is such a rare edge case of URLs, I expect no compatibility problems. * wtf/URL.cpp: (WTF::URL::pathStart const): * wtf/URL.h: (WTF::URL::pathStart const): Deleted. * wtf/URLParser.cpp: (WTF::URLParser::copyURLPartsUntil): (WTF::URLParser::URLParser): (WTF::URLParser::needsNonSpecialDotSlash const): (WTF::URLParser::addNonSpecialDotSlash): * wtf/URLParser.h: Tools: * TestWebKitAPI/Tests/WTF/URLParser.cpp: (TestWebKitAPI::TEST_F): Canonical link: https://commits.webkit.org/229956@main git-svn-id: https://svn.webkit.org/repository/webkit/trunk@267837 268f45cc-cd09-0410-ab3c-d52691b4dbfc
When the URL
a:/a/..//a
is parsed, the resulting URL object has no hostname, and has the path//a
c = 'a'
state = scheme start state
buffer = ""
pointer = 0
c = ':'
state = scheme state
buffer = "a"
pointer = 1
scheme = "a"
buffer = ""
state = path or authority state
pointer = 2
pointer = 3
c = 'a'
state = path state
pointer = 2
state = path state
pointer = 3
c = 'a'
buffer = "a"
state = path state
pointer = 4
c = '/'
path = [ "a" ]
buffer = ""
state = path state
pointer = 5
c = '.'
path = [ "a" ]
buffer = "."
state = path state
pointer = 6
c = '.'
path = [ "a" ]
buffer = ".."
state = path state
pointer = 7
c = '/'
path = [""]
buffer = ""
state = path state
pointer = 8
c = '/'
path = ["", ""]
buffer = ""
state = path state
pointer = 9
c = 'a'
path = ["", ""]
buffer = "a"
state = path state
pointer = 10
c = EOF
path = ["", "", "a"]
buffer = ""
When the resulting URL is serialized, it gets serialized as
a://a
, which, if it gets reparsed, gets an empty path and a hostnamea
.Found while trying to find a spec-compliant resolution for servo/rust-url#459
This isn't a problem for special URLs, which always have a host.
The text was updated successfully, but these errors were encountered: