-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Needs a processing model #27
Comments
There already is - see "Formal Algorithm" in https://publicsuffix.org/list/ |
Ah, thank you. That could be a lot clearer, though that is more than I thought we had. Also, it only mentions Punycode. What about the other requirements of UTS #46? |
With the list being in UTF-8, it shouldn't really matter, should it? All that matters is that both list and entry follow the same canonicalization scheme. The mention of Punycode in the algorithm is just so that we can talk in terms of characters not codepoints. |
UTS #46 is not about text encodings. It's about how to map IDNA to ASCII and vice versa, which depending on the input can be much more involved than just Punycode. E.g., Punycode doesn't take care of |
Sure, but xn-- is still a form of encoding. The point was that it's largely irrelevent - each label is independent matching and all it needs is to be able to match or not match an encoded rule. The core point is just that the Punycode (or any encoding, technically) collapses labels and rules the same way so that we can determine if a label matches a rule. |
Not really, if you just use Punycode "EXAMPLE.COM" does not match "example.com", for instance. |
From the link:
|
What does lowercase mean? E.g., how do you handle ß? |
(What I'm saying is that short description is not the normal way. I'm not sure why you keep saying otherwise.) |
I'm not sure why you feel that it's ambiguous - it's worded prosaically, such that "lower-case" is a modifier for Punycode. Written less prosaically, but entirely consistently with the grammar of the existing page: Canonicalize a hostname:
There was an intentional decision not to write things as a spec (e.g. you could argue that conversation to lower case is ambiguous and you need to define how each ASCII codepoint is mapped to its lower case canonical counterpoint, or you could invoke the RFC 1034 rules for comparing hostnames in a case insensitive manner, or any number of things), but you also have to ask yourself why you are doing it. Right now, the algorithm itself was expanded from even less prose, as the goal was to make it simple to understand at a glance. |
It's not ambiguous, it's wrong. These are not the rules for domains browsers implement. You did not answer what you do with ß, for instance. |
Anne, it sounds like we are talking past each other. The PSL is used in a large number of contexts outside browsers, so there is no need (nor would it be correct) to take a browser-centric view. The text is correct for browsers, in as much as you canonicalize a rule or a host the same way you do a domain name (for transmission). While that may vary from UA to UA and PSL consumer to PSL consumer, that variance in canonicalization does not affect the algorithm or the results. The lower-case, Punycode is illustrative and not normative to these ends - a consistent canonicalization is all that is needed. As to your point about ß, I was trying to establish how you're wrong for thinking it is an issue. No implementation will send that over the wire - either IDNs are not supported (LDH rule) or they apply some transformation, such that ß is collapsed to some ASCII sequence conforming to the LDH rule. THAT sequence is then lower-cased. Your hypothetical ambiguity regarding the case folding of that is not due to ambiguity in the grammar, but a failure to read the existing steps, which while unfortunate, can happen no matter what form the steps take (prosaically vs spec). That doesn't mean it is ambiguous - it just means you're wrong :) While I had hoped it was obvious in my previous explanation without needing to highlight the (intentional?) misreading, at least that again hopefully restates how to parse that sentence, both standalone and in the context from which it was removed from. I also wanted to highlight the past discussions around the desired level of formality of the algorithm, and why the prosaic form was chosen, as well as reiterating (as I have several times) why any perceived ambiguity of that sentence is inconsequential to the processing of the algorithm. |
I didn't mean to take a browser-centric view, I was just giving an example. I think all user agents should have the same domain processing. Anyway, I guess I understand your POV, but this makes this not as useful as a reference as I'd hoped. But it seems reasonable enough to build something on top of this which can be used by HTML, Storage, URL, etc. |
BTW, we have a uppercase ß since 2010 (ẞ). |
@rockdaboot Yes, but the point was that it doesn't matter for the PSL :) Whatever canonicalization is employed for domain host names, the same is applied for the list, before processing. That should resolve all such issues of ambiguity related to non-ASCII charsets, and the DNS host name comparison rules (e.g. case insensitive ASCII) apply for matching. |
The point about "ß" is that it's commonly normalized to "ss". Not about uppercase/lowercase. |
@annevk I guess you are talking about IDNA2003 where ß is normalized to ss. This is a long known potential security problem. That's why we have IDNA2008 (plus UTS #46 extension). Software should move away from IDNA2003. See http://unicode.org/reports/tr46/ (Section 1.3.2 has some explanations/examples). |
Yes, I'm well aware of all that. (Though I disagree with your assessment.) |
@annevk do you have any specific actionable requests to add here? If not, I will close this ticket. I have already took note about the fact the use of UTF8 here is an issue I am considering to work on converting the list to a-labels at some point, along with some other possible improvements. Personally, I already pre-process the list into ASCII in publicsuffix-go, and I will probably implement the same change in the Ruby implementation. Please let me know if there are any actionable requests I should take note from here. |
The A-label approach is a good idea, though libidn2 (IDNA2008 + TR46) is pretty widespread these days and utf-8 -> a-label conversion can easily be done. Just want to mention that. I guess, no work is needed for libpsl (accepting a-lables in the PSL), but will definitely check in the next days and report back. |
I think using ASCII for the list would be a good first step. That's what all implementations have to convert to internally anyway. But I still think a more formalized processing standard would be good to have. With some algorithms and terms specifications can easily link to (and perhaps some examples as to how to do so). E.g., the HTML Standard has a rather hand-wavy PSL step at https://html.spec.whatwg.org/multipage/origin.html#is-a-registrable-domain-suffix-of-or-is-equal-to for which it would be great if we could just link to an algorithm directly. |
(Now that the site has moved to GitHub I might be able to help with formalizing the algorithm a bit, if that's agreeable.) |
The URL Standard abstracts this now in https://url.spec.whatwg.org/#host-public-suffix (to be further clarified by whatwg/url#484). That's probably good enough. |
At the moment we have just a list, but no defined processing model for that list. Without such a defined processing model, it's impossible for standards to be accurate in their requirements.
E.g., HTML has "If new value matches a suffix in the Public Suffix List", but neither "matches" nor "suffix" is defined. And Public Suffix List is an opaque blob of data.
We could define this externally, e.g., some suggested it to be defined as part of the URL Standard: https://www.w3.org/Bugs/Public/show_bug.cgi?id=25865. However, it seems better to define this model at the source, no?
The text was updated successfully, but these errors were encountered: