Define a host's "public suffix" and "registrable domain" #72

mikewest · 2015-09-11T10:52:16Z

This patch attempts to formalize the way in which the Public Suffix List
relates to URLs, thereby making it possible to more easily explain the
limitations of features like document.domain, and cookies somewhat
absurd-seeming behavior across diverse hosts.

This patch does not expose an API to query this concept, as requested in
1. Doing so in a future patch should be quite straightforward,
however, if we decide that that's a reasonable thing to do.

Preview | Diff

This patch attempts to formalize the way in which the Public Suffix List relates to URLs, thereby making it possible to more easily explain the limitations of features like `document.domain`, and cookies somewhat absurd-seeming behavior across diverse hosts. This patch does not expose an API to query this concept, as requested in [1]. Doing so in a future patch should be quite straightforward, however, if we decide that that's a reasonable thing to do. [1]: https://www.w3.org/Bugs/Public/show_bug.cgi?id=25865

mikewest · 2015-09-11T10:52:37Z

Not sure how well this is explained, but it's a starting point.

mikewest · 2015-09-11T10:56:15Z

@sleevi and other folks interested in publicsuffix/list#27 should have a look. Hi Ryan! You love the PSL SO MUCH, right? Let's get it into every spec.

sleevi · 2015-09-12T00:23:18Z

url.bs

+To obtain a <var>host</var>'s <a>public suffix</a>, run these steps:
+
+1.  If <var>host</var> is an <a for=host>IPv4 address</a> or a <a for=host>IPv6 address</a>
+    return the empty string.


"If host is not a valid domain, return the empty string" seems to be more accurate (and reflects Chrome's implementation, at least)

If our GURL canonicalizer spits it out, then we spit out the empty string.

It should be "is not a domain" then. "Valid domain" is a special term we tried to create to restrict the syntax of a domain further for the purposes of conformance. But it's not entirely baked yet or something that would be implemented by user agents.

sleevi · 2015-09-12T00:36:00Z

I'm not terribly keen for this, as explained before :)

Does the "Formal Algorithm" for the PSL, as described at https://publicsuffix.org/list/ , not meet the needs? Why not?

@annevk 's remarks on publicsuffix/list#27 suggest the reason it wouldn't is that implementations of the PSL Algorithm, with respect to parsing rules, are able to use whatever internal storage format they want (either as punycode - as Chrome does - or as full unicode).

The above algorithm doesn't treat how rules delegated as ICANN or not are handled (see "Divisions")

annevk · 2015-09-13T05:33:28Z

@sleevi they cannot just use whatever internal storage format, they can also use whatever conversion algorithm they want. Both seem problematic.

sleevi · 2015-09-13T05:40:10Z

@annevk If that's all it would take to just reference the "Formal Algorithm" of the PSL, why not simply incorporate that as a profile here?

That is, something like "Let X be the result of evaluating the formal algorithm of the PSL as if both the rules and the hostname were canonicalized as (IDNA2003, IDNA2008, UTF-8, w/e)".

That is, you're not normatively requiring an implementation that exactly follows the spec's prosaic description, only normatively requiring something that yields the same inputs and outputs, with an added requirement on what the externally observable affects are regarding conversions.

For example, if my storage was UTF-8, but the spec described it "as if IDNA2008", then prior to sending it to my PSL implementation (which is UTF-8), I'd validate whatever strings were convertible to/from IDNA2008. Doesn't that resolve it?

annevk · 2015-09-13T05:46:37Z

I think that might work, though it's not exactly pretty. That is, if that is the intended setup ideally those bits would be pluggable from the calling site of the algorithm.

sleevi · 2015-09-13T05:56:40Z

@annevk We can certainly fix the PSL side to make it, if that would help.

My main concern is that we really want there to be "one" PSL algorithm, and since the PSL lives in many more cases than just browsers, and because the algorithm is tightly coupled to the data format (with respect to rules and sections and such), it seems that algorithm should live at the PSL.

We should make it possible for those to plug in the necessary bits they need to ensure consistency between implementations of their use case (e.g. for browsers), but I think we would naturally want the rule processing algorithm to live with the set of rules :)

annevk · 2015-09-13T06:00:43Z

Okay, that sounds like a decent compromise. (Ideally of course there be no difference on this between user agents implementing the URL Standard and other user agents just dealing with domains, but if it has to be that way, this is as close as we can get.)

sleevi · 2015-09-13T06:07:14Z

@annevk I agree that would be ideal, but I think it also matters the contextual nature of the protocol. For example, if you're processing an e-mail address, you're certainly working on localname@hostname, but what 'hostname' means is not necessarily the same as the URL standard (for reasons such as encoding/escaping for the framing SMTP or for the type of name resolution). That's why I was concerned with putting a normative requirement on the PSL to expect external inputs in a particular form (or to be convertible to a particular internal storage form). Ideally, that would be left to the implementer to decide.

But we can certainly add hooks to the PSL to ensure that however a program implements the PSL, it needs to ensure the input form and storage form match, to add a step to validate that condition, and add a parameter to indicate that whatever that particular form is, that form is at least specified in the 'overall' specification (URL or otherwise)

This fixes #670 by no longer special casing IPv6. And rather than mentioning IPv4 and IPv6 directly, we instead rely on the domain concept. It changes the definition of document.domain to rely on an internal slot rather than somehow changing itself. It makes use of the host parser rather than “domain to ASCII” which is way too low-level to be used here. Potential follow up here is when “registrable domain” gets defined as per whatwg/url#72 so we no longer have to rely on Public Suffix directly.

annevk · 2016-02-15T13:23:20Z

@sleevi how would one go about patching/PR'ing the PSL algorithm to take these hooks? I would love to get this defined better.

sleevi · 2016-02-15T13:31:03Z

One does not simply patch the PSL algorithm - it documents the state of what is shipping in hundreds of unique implementations. The normal spec challenges - the world we have vs what we want

annevk · 2016-02-15T13:42:42Z

@sleevi earlier you said we could add hooks.

sleevi · 2016-02-15T13:48:14Z

Right, and in that time we've been working through issues with implementations, in particular, wild cards, that have exposed the PSL much more fragile than anticipated. Re-reading these hooks (... It has been five months), we can add documentation to the effect of the spec by submitting an Issue to https://github.com/publicsuffix/list

annevk · 2016-02-15T14:04:35Z

I don't think much changed to that algorithm over the past five months. publicsuffix/list#27 is still open, I guess I can open a new issue and close that one if you think that helps?

I still think the Formal Algorithm is hugely confusing by stating "The domain and all rules must be canonicalized in the normal way for hostnames - lower-case, Punycode (RFC 3492)." which is not at all the normal way. E.g., that does not acknowledge browsers recognize (and normalize away) four types of dots.

annevk · 2017-02-20T10:38:05Z

We'll leave this to HTML for now and require everyone to reuse the algorithm defined there: whatwg/html#2365.

sleevi reviewed Sep 12, 2015
View reviewed changes

annevk mentioned this pull request Feb 11, 2016

Revamp the way document.domain is defined whatwg/html#678

Merged

mikewest mentioned this pull request Feb 17, 2017

Refactor the document.domain attribute setter as a standalone algorithm whatwg/html#2365

Merged

annevk closed this Feb 20, 2017

This was referenced May 24, 2018

"If the given value is not a registrable domain ..." whatwg/html#3711

Closed

Define Cross-Origin-Resource-Policy response header whatwg/fetch#733

Merged

mikewest mentioned this pull request May 25, 2018

Define hosts' public suffix and registrable domain. #391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define a host's "public suffix" and "registrable domain" #72

Define a host's "public suffix" and "registrable domain" #72

mikewest commented Sep 11, 2015 •

edited by pr-preview bot

Loading

mikewest commented Sep 11, 2015

mikewest commented Sep 11, 2015

sleevi Sep 12, 2015

annevk Sep 13, 2015

sleevi commented Sep 12, 2015

annevk commented Sep 13, 2015

sleevi commented Sep 13, 2015

annevk commented Sep 13, 2015

sleevi commented Sep 13, 2015

annevk commented Sep 13, 2015

sleevi commented Sep 13, 2015

annevk commented Feb 15, 2016

sleevi commented Feb 15, 2016

annevk commented Feb 15, 2016

sleevi commented Feb 15, 2016

annevk commented Feb 15, 2016

annevk commented Feb 20, 2017

Define a host's "public suffix" and "registrable domain" #72

Define a host's "public suffix" and "registrable domain" #72

Conversation

mikewest commented Sep 11, 2015 • edited by pr-preview bot Loading

mikewest commented Sep 11, 2015

mikewest commented Sep 11, 2015

sleevi Sep 12, 2015

Choose a reason for hiding this comment

annevk Sep 13, 2015

Choose a reason for hiding this comment

sleevi commented Sep 12, 2015

annevk commented Sep 13, 2015

sleevi commented Sep 13, 2015

annevk commented Sep 13, 2015

sleevi commented Sep 13, 2015

annevk commented Sep 13, 2015

sleevi commented Sep 13, 2015

annevk commented Feb 15, 2016

sleevi commented Feb 15, 2016

annevk commented Feb 15, 2016

sleevi commented Feb 15, 2016

annevk commented Feb 15, 2016

annevk commented Feb 20, 2017

mikewest commented Sep 11, 2015 •

edited by pr-preview bot

Loading