-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make get_tld return only eTLDs in the publicsuffix list #19
base: develop
Are you sure you want to change the base?
Conversation
assert 'jp' == psl.get_tld('kobe.jp') | ||
assert 'com' == psl.get_tld('amazonaws.com') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither kobe.jp
nor amazonaws.com
is in https://publicsuffix.org/list/public_suffix_list.dat .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reordered for clarity...
*.kobe.jp was in the psl at some point, as are several subdomains of amazon.aws.
so these are good examples of the behavior you want to fix.
the public suffix list does change regularly, so we have to take some care that we aren't dependent on creating tests that depend on something we don't control. the behavior should be independent of the actual PSL. Also, as we noted in Vadym's pull request, some of the unit tests on the mozilla site were faulty (didn't follow the stated algorithm).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our tests shouldn't be reliant on moving data. we should use a fixed test file.
otherwise, agree... unless the wildcards are turned off, and then these become entries in the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our tests shouldn't be reliant on moving data. we should use a fixed test file.
I perfectly agree with you. However, we should create another issue to solve that problem.
I found that we already use static data for testing.
tests.py
Outdated
assert 'com.pg' == psl.get_tld('telinet.com.pg', wildcard=True) | ||
assert 'pg' == psl.get_tld('telinet.com.pg', wildcard=False) | ||
assert None == psl.get_tld('telinet.com.pg', wildcard=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't use *.pg
when we specify wildcard=False
. Moreover, publicsuffix list doesn't contain pg
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure i agree here. certain companies remove the wildcards in the PSL so they can reduce the variety of output. if *.pg is in the list, which it usually is, then, when they remove the wildcard, pg will be in the list. It doesn't remove the entry itself, just the wildcard.
we could argue that users shouldn't use the list this way (but they do), and i don't think we want to have a situation where a user puts in a legitimate domain and gets None for the tld. In this particular case, e.g., 'pg' is a real TLD, so it should never return None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to this, I now understand the semantics of wildcard=False
. I fix the behavior on c6575d5 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for looking through this. i made comments via the wildcard to help explain. if the README isn't clear enough, we can add a few more lines.
Primary concern is on wildcard use and that we should never return a None TLD when the domain contains a legit one.
tests.py
Outdated
@@ -212,8 +212,8 @@ def test_get_sld_exceptions(self): | |||
def test_get_sld_no_wildcard(self): | |||
psl = publicsuffix.PublicSuffixList() | |||
# test completion when no wildcards should be processed | |||
assert 'com.pg' == psl.get_sld('telinet.com.pg', wildcard=False) | |||
expected = 'ap-southeast-1.elb.amazonaws.com' | |||
assert 'pg' == psl.get_sld('telinet.com.pg', wildcard=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't agree with this one. when no wildcards are used (or totally regardless of the PSL at all), pg is a TLD. and so the sld is going to be one level up, i.e., 'com.pg'. It's never the case that 'pg' is an SLD.
tests.py
Outdated
assert 'com.pg' == psl.get_sld('telinet.com.pg', wildcard=False) | ||
expected = 'ap-southeast-1.elb.amazonaws.com' | ||
assert 'pg' == psl.get_sld('telinet.com.pg', wildcard=False) | ||
expected = 'amazonaws.com' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this test was made when '*.amazonaws.com' was in the PSL -- so like Vadym's tests we should probably use a set list that isn't changing. In that case, same logic would happen for this. the wildcard goes away, and the item is processed as a suffix. I guess the readme isn't clear enough on this?
anyhow, currently this is not wildcarded in the PSL so it looks wrong, but the test should be on a fixed example input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and it seems that *amazonaws.com
has never been on the publicsuffix list before.
$ git remote -v
origin ssh://git@github.com/publicsuffix/list.git (fetch)
origin ssh://git@github.com/publicsuffix/list.git (push)
$ git log -S'*.amazonaws.com' -p
$
assert 'jp' == psl.get_tld('kobe.jp') | ||
assert 'com' == psl.get_tld('amazonaws.com') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our tests shouldn't be reliant on moving data. we should use a fixed test file.
otherwise, agree... unless the wildcards are turned off, and then these become entries in the list.
tests.py
Outdated
assert 'com.pg' == psl.get_tld('com.pg', wildcard=True) | ||
assert 'pg' == psl.get_tld('com.pg', wildcard=False) | ||
assert None == psl.get_tld('com.pg', wildcard=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should never return none for a valid TLD.
@hiratara I'm sorry somehow I lost track of this -- and a pandemic. I'm really sorry. Happy to check again. I checked in today because I realized that the code is not handling the private suffixes as expected... and we have tests for them so I'm super confused. Do you want to revisit and merge this PR? there is another one on Nov 3rd from someone else. The bug i see is that we should have
as the documentation and tests show, but instead we have
|
Hello @KnitCode,
Unfortunately, it has always behaved that way. I tried to check out https://github.com/nexB/python-publicsuffix2/releases/tag/release-2.2019-08-12 (before I contributed this repository) and found that I also tried 9cd5417, the commit on which the document was added, but I got the same result. I think the documentation is wrong. I understand it is a bug, but it is not related this PR. We should open a new issue. |
@hiratara oh interesting. I checked my old old versions as well. I think i wrote that part of the documentation, so it must have been a mistake in the documentation. I would like to see it return I think we need @pombredanne to merge anything. |
See: #18
To solve the problem, I initialize the trie with
-1
.-1
means that there are no definitions in the publicsuffix list.This is WIP PR because I don't really understand the exact semantics of
wildcard=False
.