-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define a host's "public suffix" and "registrable domain" #72
Conversation
This patch attempts to formalize the way in which the Public Suffix List relates to URLs, thereby making it possible to more easily explain the limitations of features like `document.domain`, and cookies somewhat absurd-seeming behavior across diverse hosts. This patch does not expose an API to query this concept, as requested in [1]. Doing so in a future patch should be quite straightforward, however, if we decide that that's a reasonable thing to do. [1]: https://www.w3.org/Bugs/Public/show_bug.cgi?id=25865
Not sure how well this is explained, but it's a starting point. |
@sleevi and other folks interested in publicsuffix/list#27 should have a look. Hi Ryan! You love the PSL SO MUCH, right? Let's get it into every spec. |
To obtain a <var>host</var>'s <a>public suffix</a>, run these steps: | ||
|
||
1. If <var>host</var> is an <a for=host>IPv4 address</a> or a <a for=host>IPv6 address</a> | ||
return the empty string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If host is not a valid domain, return the empty string" seems to be more accurate (and reflects Chrome's implementation, at least)
If our GURL canonicalizer spits it out, then we spit out the empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be "is not a domain" then. "Valid domain" is a special term we tried to create to restrict the syntax of a domain further for the purposes of conformance. But it's not entirely baked yet or something that would be implemented by user agents.
I'm not terribly keen for this, as explained before :) Does the "Formal Algorithm" for the PSL, as described at https://publicsuffix.org/list/ , not meet the needs? Why not? @annevk 's remarks on publicsuffix/list#27 suggest the reason it wouldn't is that implementations of the PSL Algorithm, with respect to parsing rules, are able to use whatever internal storage format they want (either as punycode - as Chrome does - or as full unicode). The above algorithm doesn't treat how rules delegated as ICANN or not are handled (see "Divisions") |
@sleevi they cannot just use whatever internal storage format, they can also use whatever conversion algorithm they want. Both seem problematic. |
@annevk If that's all it would take to just reference the "Formal Algorithm" of the PSL, why not simply incorporate that as a profile here? That is, something like "Let X be the result of evaluating the formal algorithm of the PSL as if both the rules and the hostname were canonicalized as (IDNA2003, IDNA2008, UTF-8, w/e)". That is, you're not normatively requiring an implementation that exactly follows the spec's prosaic description, only normatively requiring something that yields the same inputs and outputs, with an added requirement on what the externally observable affects are regarding conversions. For example, if my storage was UTF-8, but the spec described it "as if IDNA2008", then prior to sending it to my PSL implementation (which is UTF-8), I'd validate whatever strings were convertible to/from IDNA2008. Doesn't that resolve it? |
I think that might work, though it's not exactly pretty. That is, if that is the intended setup ideally those bits would be pluggable from the calling site of the algorithm. |
@annevk We can certainly fix the PSL side to make it, if that would help. My main concern is that we really want there to be "one" PSL algorithm, and since the PSL lives in many more cases than just browsers, and because the algorithm is tightly coupled to the data format (with respect to rules and sections and such), it seems that algorithm should live at the PSL. We should make it possible for those to plug in the necessary bits they need to ensure consistency between implementations of their use case (e.g. for browsers), but I think we would naturally want the rule processing algorithm to live with the set of rules :) |
Okay, that sounds like a decent compromise. (Ideally of course there be no difference on this between user agents implementing the URL Standard and other user agents just dealing with domains, but if it has to be that way, this is as close as we can get.) |
@annevk I agree that would be ideal, but I think it also matters the contextual nature of the protocol. For example, if you're processing an e-mail address, you're certainly working on localname@hostname, but what 'hostname' means is not necessarily the same as the URL standard (for reasons such as encoding/escaping for the framing SMTP or for the type of name resolution). That's why I was concerned with putting a normative requirement on the PSL to expect external inputs in a particular form (or to be convertible to a particular internal storage form). Ideally, that would be left to the implementer to decide. But we can certainly add hooks to the PSL to ensure that however a program implements the PSL, it needs to ensure the input form and storage form match, to add a step to validate that condition, and add a parameter to indicate that whatever that particular form is, that form is at least specified in the 'overall' specification (URL or otherwise) |
This fixes #670 by no longer special casing IPv6. And rather than mentioning IPv4 and IPv6 directly, we instead rely on the domain concept. It changes the definition of document.domain to rely on an internal slot rather than somehow changing itself. It makes use of the host parser rather than “domain to ASCII” which is way too low-level to be used here. Potential follow up here is when “registrable domain” gets defined as per whatwg/url#72 so we no longer have to rely on Public Suffix directly.
This fixes #670 by no longer special casing IPv6. And rather than mentioning IPv4 and IPv6 directly, we instead rely on the domain concept. It changes the definition of document.domain to rely on an internal slot rather than somehow changing itself. It makes use of the host parser rather than “domain to ASCII” which is way too low-level to be used here. Potential follow up here is when “registrable domain” gets defined as per whatwg/url#72 so we no longer have to rely on Public Suffix directly.
This fixes #670 by no longer special casing IPv6. And rather than mentioning IPv4 and IPv6 directly, we instead rely on the domain concept. It changes the definition of document.domain to rely on an internal slot rather than somehow changing itself. It makes use of the host parser rather than “domain to ASCII” which is way too low-level to be used here. Potential follow up here is when “registrable domain” gets defined as per whatwg/url#72 so we no longer have to rely on Public Suffix directly.
@sleevi how would one go about patching/PR'ing the PSL algorithm to take these hooks? I would love to get this defined better. |
One does not simply patch the PSL algorithm - it documents the state of what is shipping in hundreds of unique implementations. The normal spec challenges - the world we have vs what we want |
@sleevi earlier you said we could add hooks. |
Right, and in that time we've been working through issues with implementations, in particular, wild cards, that have exposed the PSL much more fragile than anticipated. Re-reading these hooks (... It has been five months), we can add documentation to the effect of the spec by submitting an Issue to https://github.com/publicsuffix/list |
I don't think much changed to that algorithm over the past five months. publicsuffix/list#27 is still open, I guess I can open a new issue and close that one if you think that helps? I still think the Formal Algorithm is hugely confusing by stating "The domain and all rules must be canonicalized in the normal way for hostnames - lower-case, Punycode (RFC 3492)." which is not at all the normal way. E.g., that does not acknowledge browsers recognize (and normalize away) four types of dots. |
We'll leave this to HTML for now and require everyone to reuse the algorithm defined there: whatwg/html#2365. |
This patch attempts to formalize the way in which the Public Suffix List
relates to URLs, thereby making it possible to more easily explain the
limitations of features like
document.domain
, and cookies somewhatabsurd-seeming behavior across diverse hosts.
This patch does not expose an API to query this concept, as requested in
1. Doing so in a future patch should be quite straightforward,
however, if we decide that that's a reasonable thing to do.
Preview | Diff