Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification of uk.com test #864

Closed
Shardj opened this issue Aug 14, 2019 · 12 comments
Closed

Clarification of uk.com test #864

Shardj opened this issue Aug 14, 2019 · 12 comments

Comments

@Shardj
Copy link

Shardj commented Aug 14, 2019

The test checkPublicSuffix('uk.com', null); seems like an interesting one. If I follow through the algorithm provided at https://publicsuffix.org/list/ which I've pasted below for convenience. Then I come to the conclusion that the suffix for uk.com is actually com and it shouldn't be null.

  1. Match domain against all rules and take note of the matching ones.
    
  2. If no rules match, the prevailing rule is "*".
    
  3. If more than one rule matches, the prevailing rule is the one which is an exception rule.
    
  4. If there is no matching exception rule, the prevailing rule is the one with the most labels.
    
  5. If the prevailing rule is a exception rule, modify it by removing the leftmost label.
    
  6. The public suffix is the set of labels from the domain which match the labels of the prevailing rule, using the matching algorithm above.
    
  7. The registered or registrable domain is the public suffix plus one additional label.
    

It seems odd that the test says it should be null, after all https://uk.com is a perfectly real and valid website where uk.com is a valid domain. There seems to be an unspoken rule that the tld list considers any domains that 100% match a tld to be invalid, however many real world examples don't reflect this rule:

  • nhs.uk
  • platform.sh
  • s3.amazon.com

Just like uk.com, individually these are both valid urls, but they are also valid suffixes. I've been told by someone before that cases like this aren't valid because "A public suffix can not be resolved", however I've found no evidence for this on the publicsuffix.org website.

Thanks for any help

Edit: please ignore my past self mistakenly thinking the algorithm in it's current state should match uk.com to com instead of null. The issue raised here was simply trying to understand why the given domain examples couldn't have a suffix determined and instead got null back.

@sleevi
Copy link
Contributor

sleevi commented Aug 14, 2019 via email

@Shardj
Copy link
Author

Shardj commented Aug 19, 2019

Sorry but I'm not sure what rules you're referring to, the only rule I mentioned was me trying to understand why uk.com doesn't have the suffix of com in the provided test? By valid suffix I meant that they're in the public suffix list, but in this case they're also real urls.

Could you link me to somewhere I can read about this distinction or clarify what you meant? I don't quite understand what you mean. If we're making distinctions between icann and private suffixes though, is it helpful to add that nhs.uk is an ICANN suffix while uk.com is in the PRIVATE section?

@sleevi
Copy link
Contributor

sleevi commented Aug 19, 2019

See "Divisions" at https://publicsuffix.org/list/

@Shardj
Copy link
Author

Shardj commented Aug 20, 2019

That makes sense, I'm still not sure what you meant earlier though, could you reword it?

@Shardj
Copy link
Author

Shardj commented Aug 6, 2020

@sleevi it's been a year so I figured I'd come back and see if I can get any further help with this. You said something about prohibiting the icann list when allowing the private list. In which case are you saying that the unit test is correct to come back with null if using both the icann and private sections at the same time. But you're saying that in an actual implementation where you want to find out the tld of uk.com you should resolve to the private list only if you fail to get a match against both sections at the same time.

Don't you have the opposite problem from the url s3.amazonaws.com which will resolve as null against both lists at the same time, and it'll also resolve null against the private secton by itself. It'll resolve correctly against the icann section though.

So to get the following unit tests passing would this code be correct?

Unit tests:

checkPublicSuffix('uk.com', 'com');
checkPublicSuffix('s3.amazonaws.com', 'com');
checkPublicSuffix('platform.sh', 'sh');

Pseudo Code:

result = attemptResolveAgainstBothSections
if (result == null) {
  result = attemptResolveAgainstPrivate
}
if (result == null {
  result = attemptResolveAgainstICANN
}
return result

@dnsguru
Copy link
Member

dnsguru commented Aug 6, 2020

@Shardj

I've been told by someone before that cases like this aren't valid because "A public suffix can not be resolved", however I've found no evidence for this on the publicsuffix.org website.

Are you trying to check if 'someone' is right or wrong about this? PSL has zero to do with resolution, DNS does that.

PSL would impact cookie horizons, or other domain name logic within the application layer post-resolution.

In reviewing the three submissions above, each of them appear to be working as submitted by their respective operators.

It is unclear what the objective of this request is, as it does not appear to be attempting to fix what is not broken.

We're volunteers that are under extra burdens with the global pandemic situation, can you provide suggested remedy for the wording that bridges this understanding gap in a manner that makes it more clear, or can we close this please?

@Shardj
Copy link
Author

Shardj commented Aug 6, 2020

No that's just a quote from someone who was trying to explain to me that a public suffix cannot also be a valid domain, they were very insistent which confused me a fair bit as the example urls I provided all resolve just fine. That's the only reason I mentioned it.

My use case is that a bunch of urls are going through our system, and we want to know the root domain, so to know that we go through and determine the domains suffix, then we know the root domain is the suffix plus one more section on the left. Simply put, currently when urls such as nhs.uk, uk.com, platform.sh and s3.amazonaws.com get into our system it messes things up a bit as our code can't figure out the suffix, they determine as null. This results in our system assuming they're dodgy domains and incorrectly throwing them out. The code for determining the domain suffix is following the algorithm laid out at https://publicsuffix.org/list/

The objective I'm trying to reach is to fix the above issue; to stop the aforementioned domains from being considered invalid by our system due to the inability to determine their suffix. To do that I'm trying to understand why uk.com is supposed to have the suffix determined as null instead of com. After all com is the real tld of the uk.com domain if you resolve the url. It's either a misunderstanding on my part as to how this algorithm should be used, or it's a flaw in the algorithm. Most likely the former but that's why I'm here.

If you're too busy to address this at the moment feel free to leave it alone for a while and I'll check back another time.

@sleevi
Copy link
Contributor

sleevi commented Aug 8, 2020

@Shardj Ultimately, how you use the list is up to you. If you're trying to determine the "root URL", you're probably doing something wrong, honestly. Some of that is covered at https://github.com/sleevi/psl-problems

The algorithm is correct for what its returning, relative to cookies, and even then, there are edge cases. This sounds like you're running into the same situation discussed at #91

I'm not sure who said a public suffix cannot be a valid domain, because that's not the case either, obviously.

@Shardj
Copy link
Author

Shardj commented Aug 11, 2020

@sleevi Well the algorithm is described as "an algorithm for determining the Public Suffix of a domain" which is exactly what I want to use it for, since knowing the suffix means I also know the root domain. So if I'm using it wrong then I'm not sure what the right way is. Thanks for the link by the way, that's a very good read.

Yeah it does seem like a similar issue to #91, doesn't this just indicate that the algorithm is flawed though? After all it's coming out with null values when the suffix should be easy enough to determine. Then again, I suppose from a 'cookie' use case, a null value isn't an issue since it simply indicates that no other domains can ever access cookies for the given domain. So I suppose my issue comes down to using the algorithm for something other than it's originally intended purpose.

Although if the algorithm was to be modified by simply adding this step to the algorithm, "ignore the left most label/part of the domain for the following steps when matching rules against the domain", then we'd never run into this problem as we can never end up matching the whole domain as the suffix. uk.com would have it's suffix determined as com instead of null without issue.

I'm happy to close this as I can understand now that the algorithm described on the public suffix site never needed to find the suffix of uk.com as there are no possible subdomains that a cookie could be shared with anyway, so it doesn't matter if it comes back with null or com either way.

@sleevi
Copy link
Contributor

sleevi commented Aug 11, 2020

since knowing the suffix means I also know the root domain.

I don't think this means what you want. In fact, I'm not sure what you want, but that sounds wrong for most usages.

After all it's coming out with null values when the suffix should be easy enough to determine.

A null value is expected.

So I suppose my issue comes down to using the algorithm for something other than it's originally intended purpose.

Yes, that's roughly where I was going to this.

uk.com would have it's suffix determined as uk instead of null without issue.

Did you typo that? I don't believe any possible interpretation would be correct there.

@Shardj
Copy link
Author

Shardj commented Aug 17, 2020

I don't think this means what you want

Finding the root domain from a domain by determining it's suffix and adding on the next label from the left seems to be the only way unless you go and resolve the url, but that would be too slow. It's even described in the algorithms final step, "The registered or registrable domain is the public suffix plus one additional label."

A null value is expected

Yeah, but null isn't the suffix is my point, and determining the suffix is what the algorithm is understood to do. But yeah we just come back to the fact that the algorithm doesn't need to do that in this case since cookies from the given domain can't ever be shared to other domains so it just spits out null. I think we're on the same page here.

Did you typo that?

Yeah typo'd that, should've said suffix determined as com, I'll edit the original

@Shardj Shardj closed this as completed Aug 17, 2020
@Shardj
Copy link
Author

Shardj commented Aug 17, 2020

Closing it because my original question was solved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants