Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Needs a processing model #27

Closed
annevk opened this issue Aug 19, 2015 · 24 comments
Closed

Needs a processing model #27

annevk opened this issue Aug 19, 2015 · 24 comments
Assignees
Labels
waiting-followup Blocked for need of follow-up

Comments

@annevk
Copy link

annevk commented Aug 19, 2015

At the moment we have just a list, but no defined processing model for that list. Without such a defined processing model, it's impossible for standards to be accurate in their requirements.

E.g., HTML has "If new value matches a suffix in the Public Suffix List", but neither "matches" nor "suffix" is defined. And Public Suffix List is an opaque blob of data.

We could define this externally, e.g., some suggested it to be defined as part of the URL Standard: https://www.w3.org/Bugs/Public/show_bug.cgi?id=25865. However, it seems better to define this model at the source, no?

@sleevi
Copy link
Contributor

sleevi commented Aug 19, 2015

There already is - see "Formal Algorithm" in https://publicsuffix.org/list/

@annevk
Copy link
Author

annevk commented Aug 19, 2015

Ah, thank you. That could be a lot clearer, though that is more than I thought we had. Also, it only mentions Punycode. What about the other requirements of UTS #46?

@sleevi
Copy link
Contributor

sleevi commented Aug 19, 2015

With the list being in UTF-8, it shouldn't really matter, should it? All that matters is that both list and entry follow the same canonicalization scheme. The mention of Punycode in the algorithm is just so that we can talk in terms of characters not codepoints.

@annevk
Copy link
Author

annevk commented Aug 19, 2015

UTS #46 is not about text encodings. It's about how to map IDNA to ASCII and vice versa, which depending on the input can be much more involved than just Punycode. E.g., Punycode doesn't take care of xn--.

@sleevi
Copy link
Contributor

sleevi commented Aug 19, 2015

Sure, but xn-- is still a form of encoding. The point was that it's largely irrelevent - each label is independent matching and all it needs is to be able to match or not match an encoded rule. The core point is just that the Punycode (or any encoding, technically) collapses labels and rules the same way so that we can determine if a label matches a rule.

@annevk
Copy link
Author

annevk commented Aug 19, 2015

Not really, if you just use Punycode "EXAMPLE.COM" does not match "example.com", for instance.

@sleevi
Copy link
Contributor

sleevi commented Aug 19, 2015

From the link:

The domain and all rules must be canonicalized in the normal way for hostnames - lower-case, Punycode (RFC 3492).

@annevk
Copy link
Author

annevk commented Aug 19, 2015

What does lowercase mean? E.g., how do you handle ß?

@annevk
Copy link
Author

annevk commented Aug 19, 2015

(What I'm saying is that short description is not the normal way. I'm not sure why you keep saying otherwise.)

@sleevi
Copy link
Contributor

sleevi commented Aug 19, 2015

I'm not sure why you feel that it's ambiguous - it's worded prosaically, such that "lower-case" is a modifier for Punycode.

Written less prosaically, but entirely consistently with the grammar of the existing page:

Canonicalize a hostname:

  1. Punycode
  2. Convert to lower case

There was an intentional decision not to write things as a spec (e.g. you could argue that conversation to lower case is ambiguous and you need to define how each ASCII codepoint is mapped to its lower case canonical counterpoint, or you could invoke the RFC 1034 rules for comparing hostnames in a case insensitive manner, or any number of things), but you also have to ask yourself why you are doing it.

Right now, the algorithm itself was expanded from even less prose, as the goal was to make it simple to understand at a glance.

@annevk
Copy link
Author

annevk commented Aug 19, 2015

It's not ambiguous, it's wrong. These are not the rules for domains browsers implement. You did not answer what you do with ß, for instance.

@sleevi
Copy link
Contributor

sleevi commented Aug 19, 2015

Anne, it sounds like we are talking past each other.

The PSL is used in a large number of contexts outside browsers, so there is no need (nor would it be correct) to take a browser-centric view. The text is correct for browsers, in as much as you canonicalize a rule or a host the same way you do a domain name (for transmission). While that may vary from UA to UA and PSL consumer to PSL consumer, that variance in canonicalization does not affect the algorithm or the results. The lower-case, Punycode is illustrative and not normative to these ends - a consistent canonicalization is all that is needed.

As to your point about ß, I was trying to establish how you're wrong for thinking it is an issue. No implementation will send that over the wire - either IDNs are not supported (LDH rule) or they apply some transformation, such that ß is collapsed to some ASCII sequence conforming to the LDH rule. THAT sequence is then lower-cased. Your hypothetical ambiguity regarding the case folding of that is not due to ambiguity in the grammar, but a failure to read the existing steps, which while unfortunate, can happen no matter what form the steps take (prosaically vs spec). That doesn't mean it is ambiguous - it just means you're wrong :)

While I had hoped it was obvious in my previous explanation without needing to highlight the (intentional?) misreading, at least that again hopefully restates how to parse that sentence, both standalone and in the context from which it was removed from. I also wanted to highlight the past discussions around the desired level of formality of the algorithm, and why the prosaic form was chosen, as well as reiterating (as I have several times) why any perceived ambiguity of that sentence is inconsequential to the processing of the algorithm.

@annevk
Copy link
Author

annevk commented Aug 19, 2015

I didn't mean to take a browser-centric view, I was just giving an example. I think all user agents should have the same domain processing.

Anyway, I guess I understand your POV, but this makes this not as useful as a reference as I'd hoped. But it seems reasonable enough to build something on top of this which can be used by HTML, Storage, URL, etc.

@rockdaboot
Copy link
Contributor

BTW, we have a uppercase ß since 2010 (ẞ).

@sleevi
Copy link
Contributor

sleevi commented Aug 19, 2015

@rockdaboot Yes, but the point was that it doesn't matter for the PSL :)

Whatever canonicalization is employed for domain host names, the same is applied for the list, before processing. That should resolve all such issues of ambiguity related to non-ASCII charsets, and the DNS host name comparison rules (e.g. case insensitive ASCII) apply for matching.

@rockdaboot
Copy link
Contributor

@sleevi Thanks thats how I understand it. Just wanted to say @annevk that there in fact is an uppercase ß.

@annevk
Copy link
Author

annevk commented Aug 20, 2015

The point about "ß" is that it's commonly normalized to "ss". Not about uppercase/lowercase.

@rockdaboot
Copy link
Contributor

@annevk I guess you are talking about IDNA2003 where ß is normalized to ss. This is a long known potential security problem. That's why we have IDNA2008 (plus UTS #46 extension). Software should move away from IDNA2003. See http://unicode.org/reports/tr46/ (Section 1.3.2 has some explanations/examples).
On Debian (unstable) we still have 89 packages depending on libidn11 (IDNA2003).
(apt-cache rdepends libidn11|grep -c '^ ')
libidn2 (IDNA2008) isn't used at all (except for idn2 utility)
libicu52 (IDNA2008, also has #46): 105 packages

@annevk
Copy link
Author

annevk commented Aug 20, 2015

Yes, I'm well aware of all that. (Though I disagree with your assessment.)

@weppos
Copy link
Member

weppos commented Feb 17, 2018

@annevk do you have any specific actionable requests to add here? If not, I will close this ticket.

I have already took note about the fact the use of UTF8 here is an issue
https://github.com/publicsuffix/list/wiki/Design-issues
and both @sleevi and @gerv agreed. See also weppos/publicsuffix-go#31

I am considering to work on converting the list to a-labels at some point, along with some other possible improvements. Personally, I already pre-process the list into ASCII in publicsuffix-go, and I will probably implement the same change in the Ruby implementation.

Please let me know if there are any actionable requests I should take note from here.

@weppos weppos self-assigned this Feb 17, 2018
@weppos weppos added the waiting-followup Blocked for need of follow-up label Feb 17, 2018
@rockdaboot
Copy link
Contributor

The A-label approach is a good idea, though libidn2 (IDNA2008 + TR46) is pretty widespread these days and utf-8 -> a-label conversion can easily be done. Just want to mention that.

I guess, no work is needed for libpsl (accepting a-lables in the PSL), but will definitely check in the next days and report back.

@annevk
Copy link
Author

annevk commented Feb 17, 2018

I think using ASCII for the list would be a good first step. That's what all implementations have to convert to internally anyway.

But I still think a more formalized processing standard would be good to have. With some algorithms and terms specifications can easily link to (and perhaps some examples as to how to do so). E.g., the HTML Standard has a rather hand-wavy PSL step at https://html.spec.whatwg.org/multipage/origin.html#is-a-registrable-domain-suffix-of-or-is-equal-to for which it would be great if we could just link to an algorithm directly.

@annevk
Copy link
Author

annevk commented Feb 17, 2018

(Now that the site has moved to GitHub I might be able to help with formalizing the algorithm a bit, if that's agreeable.)

@annevk
Copy link
Author

annevk commented Apr 26, 2020

The URL Standard abstracts this now in https://url.spec.whatwg.org/#host-public-suffix (to be further clarified by whatwg/url#484). That's probably good enough.

@annevk annevk closed this as completed Apr 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-followup Blocked for need of follow-up
Projects
None yet
Development

No branches or pull requests

5 participants
@weppos @sleevi @annevk @rockdaboot and others