Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percent encoding does not round-trip #523

Closed
karwa opened this issue Jun 2, 2020 · 3 comments · Fixed by #727
Closed

Percent encoding does not round-trip #523

karwa opened this issue Jun 2, 2020 · 3 comments · Fixed by #727
Labels
clarification Standard could be clearer

Comments

@karwa
Copy link
Contributor

karwa commented Jun 2, 2020

Hi!

I'm trying to implement the current version of the spec in Swift (for non-browser applications. The idea is that it's useful to have a URL type that behaves as your browser behaves and accepts/rejects the same things).

One issue that I've noticed is that the current definition of the "UTF8 percent encode" algorithm doesn't round-trip for strings which contain the percent character. For example, following the algorithm, the string "%100" (UTF8: [37, 49, 48, 48]) is not changed at all when encoding (regardless or the percent-encoding set; none of them contain the "%" character itself). However, decoding that same string using the decoding algorithm in the spec results in the UTF8 sequence [16, 48], or "0" (ASCII 0x10 is the unprintable "data link escape" character).

RFC-3986 warns about this:

Because the percent ("%") character serves as the indicator for
percent-encoded octets, it must be percent-encoded as "%25" for that
octet to be used as data within a URI.

Indeed, the encodeURIComponent JS function encodes the percent character:

> encodeURIComponent("%100")
"%25100"

I was ready to submit a PR to have the spec's algorithm also do this, but it appears there is an explicit test that percent characters are not escaped (in this case, in the URL's username component): https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestdata.json#L2372

    "input": "http://%25DOMAIN:foobar@foodomain.com/",
    "base": "about:blank",
    "href": "http://%25DOMAIN:foobar@foodomain.com/",
    "origin": "http://foodomain.com",
    "protocol": "http:",
    "username": "%25DOMAIN",   // <--- I would expect this to be "%2525DOMAIN"

I'm not sure if this is correct. Usernames are required to be escaped (meaning they must be unescaped to recover their original value), as in the following example:

> new URL("http://;DOMAIN:foobar@foodomain.com/")
...
username: "%3BDOMAIN" // <--- Semicolon is escaped as %3B

However, unescaping the string "%25DOMAIN" as the test expects would not recover the original and result in a different username - "%DOMAIN".

Can anybody confirm the correct behaviour? I think we should be escaping the % character, and that the test is incorrect. If that's not the case (and the current behaviour is correct), this could use a note or warning in the spec.

@annevk
Copy link
Member

annevk commented Jun 2, 2020

See #513 and #518 for some upcoming changes in this area. They don't really change the status quo, but do add percent-encode sets that include U+0025, including one that is an exact match for the one in JavaScript.

And I guess we could add a note that it really depends on the caller which of the various percent-encode sets is safe to use.

@domenic
Copy link
Member

domenic commented Jun 2, 2020

I can confirm that the spec and tests are correct, i.e. they match browsers. See https://jsdom.github.io/whatwg-url/#url=aHR0cDovLyUyNURPTUFJTjpmb29iYXJAZm9vZG9tYWluLmNvbS8=&base=YWJvdXQ6Ymxhbms=

@annevk annevk added the clarification Standard could be clearer label Oct 20, 2021
annevk added a commit that referenced this issue Dec 16, 2022
Also add some <div algorithm> wrappers to this section.

Closes #523.
@annevk
Copy link
Member

annevk commented Dec 19, 2022

For clarity, if your username is %25DOMAIN you would have to percent-encode it before stuffing it in the URL: %2525DOMAIN. The URL parser will not percent-encode % for you. #727 attempts to clarify this. (My example doesn't use username, but the same principle applies.)

annevk added a commit that referenced this issue Dec 20, 2022
Also add some <div algorithm> wrappers to this section.

Closes #523.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Standard could be clearer
Development

Successfully merging a pull request may close this issue.

3 participants