-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should "everything after the scheme" URLs work? #385
Comments
i'd lean toward (1) under the theory that there are likely registered schemes where percent-decoding and white-space stripping are inappropriate. |
Closes #19. This technique is inspired by the data: URL processor's initial steps: https://fetch.spec.whatwg.org/commit-snapshots/7307d282dd7d1293d5697d63f73522007849e0db/#data-url-processor. Whether or not this technique is ideal, is an open question. See whatwg/url#385.
See also: nodejs/node#35434 (comment) |
I'm puzzling over (my) characterisation of the WHATWG resolution and this issue came to mind. Let's look at the properties of parsed/ resolved URLs:
These properties are natural consequences of the protocols. For non-special URLs the parser/ resolver uses the 'cannot-be-a-base-url' flag to decide if the URL is a base URL. This amounts to the following:
So I think it makes sense to define what is and what is not a base URL, based on the protocol only.
That requires a hardcoded list of protocols and their associated URL 'type' (ie. parsing/resolving behaviour) though. Just some ideas. |
Having a largely protocol-agnostic parser is a design goal. Having to tweak the parser or getting different parser outcomes over time is far from ideal. (While at the moment this still happens due to convergence between implementations, my hope is that long term it won't.) |
Completely agree, however it does seem accurate to distinguish a few categories. It is very strange to apply path normalisation to javascript URLs, for example. I think there is a consistent, more general pattern here. |
Maybe it makes sense to define a few “special exceptions” like However, maybe it also makes sense to allow for implementations that give value to specific URLs to interpret and parse them specially. I know that Of course, that would be awful in a way, because then different implementations would parse the same URL differently, so people couldn’t rely on manipulating URLs working the same way across implementations, which is what this spec is aiming to solve. Maybe a good approach could be to establish a (limited) set of normalization rules that can be applied to URLs by implementations, enforcing specific normalization rules for certain URLs like So, for example, the spec could allow implementations to change the port of URLs freely depending on the scheme without requiring it to be fetched and redirected (as long as they do it consistently), then e.g. Some other modifications and normalizations could likewise be done in a similar way, by being required for well‐known URLs, and allowed for other URLs. The key here, I think, is that the set of normalization rules that can even be applied to URLs is already well known beforehand and is not arbitrary, so it is possible for authors to enjoy a consistent URL handling across implementations. |
There are several URL types that are basically of the form
scheme:<some arbitrary data>
. For example,data:
,mailto:
,javascript:
, andurn:
.The question is, how should software process these URLs? I see three main models:
scheme:
, then look at everything after that.data:
URL processor spec worksjavascript:
URL processing is specced (although I don't think we have extensive tests in that area)<some arbitrary data>
contains?
s or#
s, you have to model that as allowing queries and fragments, and then processing${path}?${query}#${fragment}
. Whereas (2) just lets you process the whole string at once.An interesting example contrasting (2) and (3) is the following:
javascript://somehost/%0Aalert(1)
//somehost/\nalert(1)
is interpreted as a comment followed by an alert.javascript:
URLs.Another example is that
mailto:///d@domenic.me
is interpreted as containing a<some data here>
of///d@domenic.me
in (2) and a path of/d@domenic.me
in (3). Maybe not relevant since I doubt many mail clients will let you send email to such an address?There are probably more interesting examples of this sort.
The purpose of this thread is to gather community thoughts on these scenarios, with an eye toward setting a precedent for future such schemes, and providing recommendations for software that processes such URLs (including both the web's specced
data:
andjavascript:
, and other schemes likemailto:
orurn:
).If we decide (2) is better, we should provide better spec support for it, including helper operations and explicit recommendations to continue doing this pattern. If we decide (3) is better, we should do the same, and we should either explicitly note
data:
andjavascript:
's processing models as legacy, or we should try to change them (which might be possible if interop is bad)./ccing some people who might have thoughts: @mnot @jasnell @sleevi @masinter
The text was updated successfully, but these errors were encountered: