Needs a processing model #27

annevk · 2015-08-19T09:42:11Z

At the moment we have just a list, but no defined processing model for that list. Without such a defined processing model, it's impossible for standards to be accurate in their requirements.

E.g., HTML has "If new value matches a suffix in the Public Suffix List", but neither "matches" nor "suffix" is defined. And Public Suffix List is an opaque blob of data.

We could define this externally, e.g., some suggested it to be defined as part of the URL Standard: https://www.w3.org/Bugs/Public/show_bug.cgi?id=25865. However, it seems better to define this model at the source, no?

sleevi · 2015-08-19T11:21:38Z

There already is - see "Formal Algorithm" in https://publicsuffix.org/list/

annevk · 2015-08-19T12:10:41Z

Ah, thank you. That could be a lot clearer, though that is more than I thought we had. Also, it only mentions Punycode. What about the other requirements of UTS #46?

sleevi · 2015-08-19T12:15:34Z

With the list being in UTF-8, it shouldn't really matter, should it? All that matters is that both list and entry follow the same canonicalization scheme. The mention of Punycode in the algorithm is just so that we can talk in terms of characters not codepoints.

annevk · 2015-08-19T12:31:26Z

UTS #46 is not about text encodings. It's about how to map IDNA to ASCII and vice versa, which depending on the input can be much more involved than just Punycode. E.g., Punycode doesn't take care of xn--.

sleevi · 2015-08-19T14:58:39Z

Sure, but xn-- is still a form of encoding. The point was that it's largely irrelevent - each label is independent matching and all it needs is to be able to match or not match an encoded rule. The core point is just that the Punycode (or any encoding, technically) collapses labels and rules the same way so that we can determine if a label matches a rule.

annevk · 2015-08-19T15:11:07Z

Not really, if you just use Punycode "EXAMPLE.COM" does not match "example.com", for instance.

sleevi · 2015-08-19T15:32:49Z

From the link:

The domain and all rules must be canonicalized in the normal way for hostnames - lower-case, Punycode (RFC 3492).

annevk · 2015-08-19T15:38:02Z

What does lowercase mean? E.g., how do you handle ß?

annevk · 2015-08-19T15:38:33Z

(What I'm saying is that short description is not the normal way. I'm not sure why you keep saying otherwise.)

sleevi · 2015-08-19T15:46:04Z

I'm not sure why you feel that it's ambiguous - it's worded prosaically, such that "lower-case" is a modifier for Punycode.

Written less prosaically, but entirely consistently with the grammar of the existing page:

Canonicalize a hostname:

Punycode
Convert to lower case

There was an intentional decision not to write things as a spec (e.g. you could argue that conversation to lower case is ambiguous and you need to define how each ASCII codepoint is mapped to its lower case canonical counterpoint, or you could invoke the RFC 1034 rules for comparing hostnames in a case insensitive manner, or any number of things), but you also have to ask yourself why you are doing it.

Right now, the algorithm itself was expanded from even less prose, as the goal was to make it simple to understand at a glance.

annevk · 2015-08-19T15:54:47Z

It's not ambiguous, it's wrong. These are not the rules for domains browsers implement. You did not answer what you do with ß, for instance.

sleevi · 2015-08-19T16:05:06Z

Anne, it sounds like we are talking past each other.

The PSL is used in a large number of contexts outside browsers, so there is no need (nor would it be correct) to take a browser-centric view. The text is correct for browsers, in as much as you canonicalize a rule or a host the same way you do a domain name (for transmission). While that may vary from UA to UA and PSL consumer to PSL consumer, that variance in canonicalization does not affect the algorithm or the results. The lower-case, Punycode is illustrative and not normative to these ends - a consistent canonicalization is all that is needed.

As to your point about ß, I was trying to establish how you're wrong for thinking it is an issue. No implementation will send that over the wire - either IDNs are not supported (LDH rule) or they apply some transformation, such that ß is collapsed to some ASCII sequence conforming to the LDH rule. THAT sequence is then lower-cased. Your hypothetical ambiguity regarding the case folding of that is not due to ambiguity in the grammar, but a failure to read the existing steps, which while unfortunate, can happen no matter what form the steps take (prosaically vs spec). That doesn't mean it is ambiguous - it just means you're wrong :)

While I had hoped it was obvious in my previous explanation without needing to highlight the (intentional?) misreading, at least that again hopefully restates how to parse that sentence, both standalone and in the context from which it was removed from. I also wanted to highlight the past discussions around the desired level of formality of the algorithm, and why the prosaic form was chosen, as well as reiterating (as I have several times) why any perceived ambiguity of that sentence is inconsequential to the processing of the algorithm.

annevk · 2015-08-19T16:13:12Z

I didn't mean to take a browser-centric view, I was just giving an example. I think all user agents should have the same domain processing.

Anyway, I guess I understand your POV, but this makes this not as useful as a reference as I'd hoped. But it seems reasonable enough to build something on top of this which can be used by HTML, Storage, URL, etc.

rockdaboot · 2015-08-19T16:46:47Z

BTW, we have a uppercase ß since 2010 (ẞ).

sleevi · 2015-08-19T16:49:40Z

@rockdaboot Yes, but the point was that it doesn't matter for the PSL :)

Whatever canonicalization is employed for domain host names, the same is applied for the list, before processing. That should resolve all such issues of ambiguity related to non-ASCII charsets, and the DNS host name comparison rules (e.g. case insensitive ASCII) apply for matching.

rockdaboot · 2015-08-19T18:19:28Z

@sleevi Thanks thats how I understand it. Just wanted to say @annevk that there in fact is an uppercase ß.

annevk · 2015-08-20T05:32:30Z

The point about "ß" is that it's commonly normalized to "ss". Not about uppercase/lowercase.

rockdaboot · 2015-08-20T08:07:05Z

@annevk I guess you are talking about IDNA2003 where ß is normalized to ss. This is a long known potential security problem. That's why we have IDNA2008 (plus UTS #46 extension). Software should move away from IDNA2003. See http://unicode.org/reports/tr46/ (Section 1.3.2 has some explanations/examples).
On Debian (unstable) we still have 89 packages depending on libidn11 (IDNA2003).
(apt-cache rdepends libidn11|grep -c '^ ')
libidn2 (IDNA2008) isn't used at all (except for idn2 utility)
libicu52 (IDNA2008, also has #46): 105 packages

annevk · 2015-08-20T08:13:46Z

Yes, I'm well aware of all that. (Though I disagree with your assessment.)

weppos · 2018-02-17T15:02:26Z

@annevk do you have any specific actionable requests to add here? If not, I will close this ticket.

I have already took note about the fact the use of UTF8 here is an issue
https://github.com/publicsuffix/list/wiki/Design-issues
and both @sleevi and @gerv agreed. See also weppos/publicsuffix-go#31

I am considering to work on converting the list to a-labels at some point, along with some other possible improvements. Personally, I already pre-process the list into ASCII in publicsuffix-go, and I will probably implement the same change in the Ruby implementation.

Please let me know if there are any actionable requests I should take note from here.

rockdaboot · 2018-02-17T15:51:47Z

The A-label approach is a good idea, though libidn2 (IDNA2008 + TR46) is pretty widespread these days and utf-8 -> a-label conversion can easily be done. Just want to mention that.

I guess, no work is needed for libpsl (accepting a-lables in the PSL), but will definitely check in the next days and report back.

annevk · 2018-02-17T16:10:24Z

I think using ASCII for the list would be a good first step. That's what all implementations have to convert to internally anyway.

But I still think a more formalized processing standard would be good to have. With some algorithms and terms specifications can easily link to (and perhaps some examples as to how to do so). E.g., the HTML Standard has a rather hand-wavy PSL step at https://html.spec.whatwg.org/multipage/origin.html#is-a-registrable-domain-suffix-of-or-is-equal-to for which it would be great if we could just link to an algorithm directly.

annevk · 2018-02-17T16:11:39Z

(Now that the site has moved to GitHub I might be able to help with formalizing the algorithm a bit, if that's agreeable.)

annevk · 2020-04-26T14:53:21Z

The URL Standard abstracts this now in https://url.spec.whatwg.org/#host-public-suffix (to be further clarified by whatwg/url#484). That's probably good enough.

annevk mentioned this issue Aug 19, 2015

Host the site on GitHub #28

Closed

mikewest mentioned this issue Sep 11, 2015

Define a host's "public suffix" and "registrable domain" whatwg/url#72

Closed

weppos self-assigned this Feb 17, 2018

weppos added the waiting-followup Blocked for need of follow-up label Feb 17, 2018

annevk mentioned this issue Feb 18, 2018

define "registrable domain" "registered domain" on publicsuffix.org/list publicsuffix/publicsuffix.org#12

Open

annevk mentioned this issue Mar 27, 2019

Trailing dots and the PSL. #792

Open

dnsguru mentioned this issue Apr 9, 2020

Could we use project board method to separate list update PRs from administrata #1008

Closed

annevk closed this as completed Apr 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Needs a processing model #27

Needs a processing model #27

annevk commented Aug 19, 2015

sleevi commented Aug 19, 2015

annevk commented Aug 19, 2015

sleevi commented Aug 19, 2015

annevk commented Aug 19, 2015

sleevi commented Aug 19, 2015

annevk commented Aug 19, 2015

sleevi commented Aug 19, 2015

annevk commented Aug 19, 2015

annevk commented Aug 19, 2015

sleevi commented Aug 19, 2015

annevk commented Aug 19, 2015

sleevi commented Aug 19, 2015

annevk commented Aug 19, 2015

rockdaboot commented Aug 19, 2015

sleevi commented Aug 19, 2015

rockdaboot commented Aug 19, 2015

annevk commented Aug 20, 2015

rockdaboot commented Aug 20, 2015

annevk commented Aug 20, 2015

weppos commented Feb 17, 2018

rockdaboot commented Feb 17, 2018

annevk commented Feb 17, 2018

annevk commented Feb 17, 2018

annevk commented Apr 26, 2020

Needs a processing model #27

Needs a processing model #27

Comments

annevk commented Aug 19, 2015

sleevi commented Aug 19, 2015

annevk commented Aug 19, 2015

sleevi commented Aug 19, 2015

annevk commented Aug 19, 2015

sleevi commented Aug 19, 2015

annevk commented Aug 19, 2015

sleevi commented Aug 19, 2015

annevk commented Aug 19, 2015

annevk commented Aug 19, 2015

sleevi commented Aug 19, 2015

annevk commented Aug 19, 2015

sleevi commented Aug 19, 2015

annevk commented Aug 19, 2015

rockdaboot commented Aug 19, 2015

sleevi commented Aug 19, 2015

rockdaboot commented Aug 19, 2015

annevk commented Aug 20, 2015

rockdaboot commented Aug 20, 2015

annevk commented Aug 20, 2015

weppos commented Feb 17, 2018

rockdaboot commented Feb 17, 2018

annevk commented Feb 17, 2018

annevk commented Feb 17, 2018

annevk commented Apr 26, 2020