The spec doesn't seem clear on how to handle "incomplete" hostnames #694

rushmorem · 2018-07-19T10:27:41Z

By "incomplete" hostname, I mean a hostname that's entirely part of some rule or rules but does not have enough labels to match the rule or rules entirely. Examples of such hostnames are yokohama.jp and kobe.jp.

The relevant rules for those hostnames are:-

jp
*.kobe.jp
*.yokohama.jp
!city.yokohama.jp

I have seen these two, interpreted differently by at least two implementations and I understand how it can go either way. libpsl returns the public suffices for those domains as yokohama.jp and kobe.jp respectively. Servo's net_traits crate, however, returns jp for both, which leads to weird test cases like these.

What's the official position on this?

The text was updated successfully, but these errors were encountered:

sleevi · 2018-07-20T12:50:19Z

@rushmorem Thanks for opening this. This is the wildcard problem originally captured at https://bugzilla.mozilla.org/show_bug.cgi?id=1124625#c6 and more broadly documented at https://wiki.mozilla.org/Public_Suffix_List/platform.sh_Problem

rushmorem · 2018-07-20T14:23:51Z

Thanks @sleevi. That clarifies it. I'm rewriting my Rust implementation, so I wanted to know the correct way to handle this. I think adding these to the official test case would help iron out the differences in implementations. What do you think, should I submit a pull request?

sleevi · 2018-07-20T17:36:05Z

I’m not sure the pull request - the answer for “which is correct” hasn’t quite been resolved yet across implementations, nor do we know which “should” be correct.

…

On Fri, Jul 20, 2018 at 23:23 Rushmore Mushambi ***@***.***> wrote: Thanks @sleevi <https://github.com/sleevi>. That clarifies it. I'm rewriting my Rust implementation, so I wanted to know the correct way to handle this. I think adding these to the official test case would help iron out the differences in implementations. What do you think, should I submit a pull request? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#694 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABayJ-1ST4ofWq8qaj2uZ3xoU5AstVcSks5uIef4gaJpZM4VWJk5> .

rushmorem · 2018-07-20T19:23:02Z

According to that Wiki, you linked to:-

If we follow the defined PSL algorithm, the above rules should result in the following determinations:
 get_public_suffix(foo.bar.platform.sh) == "bar.platform.sh"
 get_public_suffix(bar.platform.sh) == "bar.platform.sh"
 get_public_suffix(platform.sh) == "sh"
 get_public_suffix(sh) == "sh"

So I thought this was already decided. In any case, I think the spec should clear this up one way or another.

peterthomassen · 2019-06-11T00:51:49Z

In the following, when I say "loose interpretation", I mean the one where the rule *.platform.sh implies that platform.sh is a public suffix.

On the other hand, there's the "strict interpretation" which takes the current rules literally, such that the rule *.platform.sh does not make a statement about whether platform.sh is a public suffix. Strict interpretation of the current rules gives that the public suffix of platform.sh is sh.

Let's assume that a client has access to a function to look up the public suffix using strict interpretation. If the client is interested in the loose interpretation, it can first look up the public suffix for platform.sh, and if the result is not the same as the query (i.e. it is sh, not platform.sh), then the client can query *.platform.sh to see if the wildcard exists, and if so, draw it's conclusions and e.g. block cookies on platform.sh etc.

If the lookup function implements loose lookup, then the client's ability to determine whether platforms.sh itself is on the PSL is lost entirely.

The strict interpretation (= literal interpretation of the current algorithm) therefore gives greater flexibility to the client, without the list showing prejudice regarding what the use case will be. I think it's a good thing for the list to not make assumptions about the use case.

Based on the documents linked here, the Chrome implementation appears to follow the loose interpretation. One solution for the problem could be to define the algorithm as strict, with Chrome (implicitly) adhering to the "two-tiered lookup" described above. This is equivalent to the loose interpretation, and the contradiction is removed.

(In the case where kobe.jp should be considered a public suffix by all clients, it could be added to the PSL explicitly. I am aware of Firefox' implementation issues, but I would think that could be coordinated, especially if the alternative would be to pay the price of losing the algorithm's generality.)

sleevi · 2019-06-11T10:05:35Z

On Tue, Jun 11, 2019 at 3:51 AM Peter Thomassen ***@***.***> wrote: If the lookup function implements loose lookup, then the client's ability to determine whether platforms.sh itself is on the PSL is lost entirely.

I think that is making assumptions about the service that aren’t specified. The service could implement the loose lookup itself and return appropriate results - not allowing for “holes” in the namespace. The strict interpretation (= literal interpretation of the current

algorithm) therefore gives greater flexibility to the client, without the list showing prejudice regarding what the use case will be. I think it's a good thing for the list to not make assumptions about the use case.

I don’t agree with this being good. I think this would be very bad. Can you explain more why you think it would be good?

peterthomassen · 2020-03-03T22:47:04Z

If the lookup function implements loose lookup, then the client's ability to determine whether platforms.sh itself is on the PSL is lost entirely.

I think that is making assumptions about the service that aren’t specified. The service could implement the loose lookup itself and return appropriate results - not allowing for “holes” in the namespace.

There is no assumption about the service here. In my original post, I wrote:

Let's assume that a client has access to a function to look up the public suffix using strict interpretation.

This is the assumption that there may be PSL client / library / other implementation existing already now that outputs the public suffix according to strict interpretation. This is an assumption not about the PSL service, but an assumption about the existence of existing implementations. Actually, it's a fact, as I know of at least one implementation that works this way, and @rushmorem said something similar in the initial post.

If the meaning of the *.platform.sh rule is relaxed to, by definition, imply that platform.sh is a public suffix, then such existing implementations will break (= their behavior changes), and they would need to be fixed.

On the other hand, the strict algorithm allows emulating the loose interpretation by first getting the public suffix of platform.sh (which turns out to be sh), and then getting the public suffix of *.platform.sh (which turns out to be *.platform.sh itself). Thus, with the current (strict) definition of the algorithm, implementations are free to implement the loose interpretation without requiring any changes in the PSL nor in other existing implementations. (This does not even require adding kobe.jp and friends to the PSL; the implementation can decide by itself!)

(Arguably, this is what Chrome does implicitly already, according to the Mozilla Wiki article -- maybe not with the two-step approach, but nevertheless, the implementation has chosen to interpret the PSL like this, and could continue doing so even if the algorithm's definition was clarified to mean the strict interpretation: In this case, even if Chrome decided to migrate to a new, strictly compliant PSL library, the two-step approach described in the previous paragraph would recover the loose interpretation's behavior, resulting in no change as far as Chrome's use case is concerned.)

The converse is not true: If the algorithm was changed to follow the loose interpretation, so that a wildcard rule's parent is always a public suffix as well (barring an exception rule), then that would reduce flexibility in the sense that implementations could not anymore decide which interpretation they want to implement. All implementations would follow the loose interpretation (permanently breaking pre-existing implementations that relied, say, on the non-publicness of kobe.jp). If an exception from the loose rule is required such as in the platform.sh cookie policy case, adding !platform.sh to the PSL would explicitly exempt platform.sh from the loose interpretation; all implementations would then consider platform.sh non-public.

Now, in turn, it is unclear why that would be desirable, as it removes the choice on implementation level. There may be use cases where the strict interpretation is preferable, and those would be thwarted by imposing the loose one.

This stems from the fact that if rules denote public-suffix policy not only about domain names with the same number of labels (dots) as in the rule, but instead also make statements about domains with a different (lower) number of labels (dots) as in the rule, the level of granularity is reduced. In the strict interpretation, granularity is higher. As a result, one can retrieve all "loose statements" from "strict rules" (you may have to check the *. child rule), while the inverse is not true.

So, defining the algorithm by the loose interpretation has the following cons:

Breaks pre-existing implementations
Requires adding !platform.sh to the PSL
Removes the possibility of choosing the interpretation depending on the application's use case

On the other hand, the strict interpretation does not have these downsides, while allowing for either use case: With the strict interpretation, you actually get (the possibility to have) both.

The strict interpretation (= literal interpretation of the current algorithm) therefore gives greater flexibility to the client, without the list showing prejudice regarding what the use case will be. I think it's a good thing for the list to not make assumptions about the use case.

I don’t agree with this being good. I think this would be very bad. Can you explain more why you think it would be good?

It would be good because of the above reasons. Why would it be bad?

ko-zu · 2024-06-20T01:27:23Z

As I commented in #1986, the conflicting rules between the wiki and the test case/linters should be resolved. I believe the test case and linter are correct, supported by implementations and intended use cases.

The existing implementations that do not follow the test case should not be a reason to leave the conflicting rules in this repository.
If someone needs to use a definition from a specific revision of the wiki, they can choose such an implementation regardless of what the current rule is. Clarifying rule that the PSL is following does not prevent users from using any rules and any implementations.

I believe it would be better to have one self-consistent rule and put a notice about the possible differences between implementations instead.

dnsguru · 2024-06-30T02:29:33Z

Couple things at play here. Sometimes both at once, but it is often one or the other.

1] Epochs / Legacy entries
We sometimes have something I call "standardization drift epochs", where specs get made more precise but there are legacy entries that were put in place before the precision was added

2] What we proclaim vs implementation choices (aka "Browsers are gonna do what browsers are gonna do")
Essentially, we attempt to document what happens, but different parties who incorporate or use the file are making their own choices about what they will do.

In some applications, the loose interpretation is adequate. In others the strict is much wiser.

This is a ultimately just a text file.

dnsguru mentioned this issue Apr 9, 2020

Could we use project board method to separate list update PRs from administrata #1008

Closed

sleevi mentioned this issue Apr 14, 2021

Clarify the use of a leading wildcard #1281

Closed

sleevi mentioned this issue Sep 2, 2021

add diher.solutions and rss.my.id to the list #1393

Merged

10 tasks

simon-friedberger mentioned this issue Jun 14, 2024

Incorrect PSL evaluation rules in the wiki regarding implicit wildcard rules #1989

Closed

publicsuffix deleted a comment from Jamirais94 Jun 30, 2024

jhnns mentioned this issue Aug 11, 2024

Incorrect parsing of wildcard rule peerigon/parse-domain#159

Open

elliotwutingfeng mentioned this issue Sep 7, 2024

Default wildcard rule missing from the PSL algorithm implementation john-kurkowski/tldextract#338

Open

sebastian-nagel mentioned this issue Oct 31, 2024

[Domains] EffectiveTldFinder to also take shorter suffix matches into account crawler-commons/crawler-commons#479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The spec doesn't seem clear on how to handle "incomplete" hostnames #694

The spec doesn't seem clear on how to handle "incomplete" hostnames #694

rushmorem commented Jul 19, 2018

sleevi commented Jul 20, 2018

rushmorem commented Jul 20, 2018

sleevi commented Jul 20, 2018 via email

rushmorem commented Jul 20, 2018

peterthomassen commented Jun 11, 2019

sleevi commented Jun 11, 2019 via email

peterthomassen commented Mar 3, 2020

ko-zu commented Jun 20, 2024

dnsguru commented Jun 30, 2024

The spec doesn't seem clear on how to handle "incomplete" hostnames #694

The spec doesn't seem clear on how to handle "incomplete" hostnames #694

Comments

rushmorem commented Jul 19, 2018

sleevi commented Jul 20, 2018

rushmorem commented Jul 20, 2018

sleevi commented Jul 20, 2018 via email

rushmorem commented Jul 20, 2018

peterthomassen commented Jun 11, 2019

sleevi commented Jun 11, 2019 via email

peterthomassen commented Mar 3, 2020

ko-zu commented Jun 20, 2024

dnsguru commented Jun 30, 2024