Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a host's "public suffix" and "registrable domain" #72

Closed
wants to merge 1 commit into from
Closed

Define a host's "public suffix" and "registrable domain" #72

wants to merge 1 commit into from

Conversation

mikewest
Copy link
Member

@mikewest mikewest commented Sep 11, 2015

This patch attempts to formalize the way in which the Public Suffix List
relates to URLs, thereby making it possible to more easily explain the
limitations of features like document.domain, and cookies somewhat
absurd-seeming behavior across diverse hosts.

This patch does not expose an API to query this concept, as requested in
1. Doing so in a future patch should be quite straightforward,
however, if we decide that that's a reasonable thing to do.


Preview | Diff

This patch attempts to formalize the way in which the Public Suffix List
relates to URLs, thereby making it possible to more easily explain the
limitations of features like `document.domain`, and cookies somewhat
absurd-seeming behavior across diverse hosts.

This patch does not expose an API to query this concept, as requested in
[1]. Doing so in a future patch should be quite straightforward,
however, if we decide that that's a reasonable thing to do.

[1]: https://www.w3.org/Bugs/Public/show_bug.cgi?id=25865
@mikewest
Copy link
Member Author

Not sure how well this is explained, but it's a starting point.

@mikewest
Copy link
Member Author

@sleevi and other folks interested in publicsuffix/list#27 should have a look. Hi Ryan! You love the PSL SO MUCH, right? Let's get it into every spec.

To obtain a <var>host</var>'s <a>public suffix</a>, run these steps:

1. If <var>host</var> is an <a for=host>IPv4 address</a> or a <a for=host>IPv6 address</a>
return the empty string.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If host is not a valid domain, return the empty string" seems to be more accurate (and reflects Chrome's implementation, at least)

If our GURL canonicalizer spits it out, then we spit out the empty string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be "is not a domain" then. "Valid domain" is a special term we tried to create to restrict the syntax of a domain further for the purposes of conformance. But it's not entirely baked yet or something that would be implemented by user agents.

@sleevi
Copy link

sleevi commented Sep 12, 2015

I'm not terribly keen for this, as explained before :)

Does the "Formal Algorithm" for the PSL, as described at https://publicsuffix.org/list/ , not meet the needs? Why not?

@annevk 's remarks on publicsuffix/list#27 suggest the reason it wouldn't is that implementations of the PSL Algorithm, with respect to parsing rules, are able to use whatever internal storage format they want (either as punycode - as Chrome does - or as full unicode).

The above algorithm doesn't treat how rules delegated as ICANN or not are handled (see "Divisions")

@annevk
Copy link
Member

annevk commented Sep 13, 2015

@sleevi they cannot just use whatever internal storage format, they can also use whatever conversion algorithm they want. Both seem problematic.

@sleevi
Copy link

sleevi commented Sep 13, 2015

@annevk If that's all it would take to just reference the "Formal Algorithm" of the PSL, why not simply incorporate that as a profile here?

That is, something like "Let X be the result of evaluating the formal algorithm of the PSL as if both the rules and the hostname were canonicalized as (IDNA2003, IDNA2008, UTF-8, w/e)".

That is, you're not normatively requiring an implementation that exactly follows the spec's prosaic description, only normatively requiring something that yields the same inputs and outputs, with an added requirement on what the externally observable affects are regarding conversions.

For example, if my storage was UTF-8, but the spec described it "as if IDNA2008", then prior to sending it to my PSL implementation (which is UTF-8), I'd validate whatever strings were convertible to/from IDNA2008. Doesn't that resolve it?

@annevk
Copy link
Member

annevk commented Sep 13, 2015

I think that might work, though it's not exactly pretty. That is, if that is the intended setup ideally those bits would be pluggable from the calling site of the algorithm.

@sleevi
Copy link

sleevi commented Sep 13, 2015

@annevk We can certainly fix the PSL side to make it, if that would help.

My main concern is that we really want there to be "one" PSL algorithm, and since the PSL lives in many more cases than just browsers, and because the algorithm is tightly coupled to the data format (with respect to rules and sections and such), it seems that algorithm should live at the PSL.

We should make it possible for those to plug in the necessary bits they need to ensure consistency between implementations of their use case (e.g. for browsers), but I think we would naturally want the rule processing algorithm to live with the set of rules :)

@annevk
Copy link
Member

annevk commented Sep 13, 2015

Okay, that sounds like a decent compromise. (Ideally of course there be no difference on this between user agents implementing the URL Standard and other user agents just dealing with domains, but if it has to be that way, this is as close as we can get.)

@sleevi
Copy link

sleevi commented Sep 13, 2015

@annevk I agree that would be ideal, but I think it also matters the contextual nature of the protocol. For example, if you're processing an e-mail address, you're certainly working on localname@hostname, but what 'hostname' means is not necessarily the same as the URL standard (for reasons such as encoding/escaping for the framing SMTP or for the type of name resolution). That's why I was concerned with putting a normative requirement on the PSL to expect external inputs in a particular form (or to be convertible to a particular internal storage form). Ideally, that would be left to the implementer to decide.

But we can certainly add hooks to the PSL to ensure that however a program implements the PSL, it needs to ensure the input form and storage form match, to add a step to validate that condition, and add a parameter to indicate that whatever that particular form is, that form is at least specified in the 'overall' specification (URL or otherwise)

annevk added a commit to whatwg/html that referenced this pull request Feb 11, 2016
This fixes #670 by no longer special casing IPv6. And rather than
mentioning IPv4 and IPv6 directly, we instead rely on the domain
concept.

It changes the definition of document.domain to rely on an internal
slot rather than somehow changing itself.

It makes use of the host parser rather than “domain to ASCII” which is
way too low-level to be used here.

Potential follow up here is when “registrable domain” gets defined as
per whatwg/url#72 so we no longer have to rely
on Public Suffix directly.
annevk added a commit to whatwg/html that referenced this pull request Feb 12, 2016
This fixes #670 by no longer special casing IPv6. And rather than
mentioning IPv4 and IPv6 directly, we instead rely on the domain
concept.

It changes the definition of document.domain to rely on an internal
slot rather than somehow changing itself.

It makes use of the host parser rather than “domain to ASCII” which is
way too low-level to be used here.

Potential follow up here is when “registrable domain” gets defined as
per whatwg/url#72 so we no longer have to rely
on Public Suffix directly.
annevk added a commit to whatwg/html that referenced this pull request Feb 12, 2016
This fixes #670 by no longer special casing IPv6. And rather than
mentioning IPv4 and IPv6 directly, we instead rely on the domain
concept.

It changes the definition of document.domain to rely on an internal
slot rather than somehow changing itself.

It makes use of the host parser rather than “domain to ASCII” which is
way too low-level to be used here.

Potential follow up here is when “registrable domain” gets defined as
per whatwg/url#72 so we no longer have to rely
on Public Suffix directly.
@annevk
Copy link
Member

annevk commented Feb 15, 2016

@sleevi how would one go about patching/PR'ing the PSL algorithm to take these hooks? I would love to get this defined better.

@sleevi
Copy link

sleevi commented Feb 15, 2016

One does not simply patch the PSL algorithm - it documents the state of what is shipping in hundreds of unique implementations. The normal spec challenges - the world we have vs what we want

@annevk
Copy link
Member

annevk commented Feb 15, 2016

@sleevi earlier you said we could add hooks.

@sleevi
Copy link

sleevi commented Feb 15, 2016

Right, and in that time we've been working through issues with implementations, in particular, wild cards, that have exposed the PSL much more fragile than anticipated. Re-reading these hooks (... It has been five months), we can add documentation to the effect of the spec by submitting an Issue to https://github.com/publicsuffix/list

@annevk
Copy link
Member

annevk commented Feb 15, 2016

I don't think much changed to that algorithm over the past five months. publicsuffix/list#27 is still open, I guess I can open a new issue and close that one if you think that helps?

I still think the Formal Algorithm is hugely confusing by stating "The domain and all rules must be canonicalized in the normal way for hostnames - lower-case, Punycode (RFC 3492)." which is not at all the normal way. E.g., that does not acknowledge browsers recognize (and normalize away) four types of dots.

@annevk
Copy link
Member

annevk commented Feb 20, 2017

We'll leave this to HTML for now and require everyone to reuse the algorithm defined there: whatwg/html#2365.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants