Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make get_tld return only eTLDs in the publicsuffix list #19

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

hiratara
Copy link
Contributor

See: #18

To solve the problem, I initialize the trie with -1. -1 means that there are no definitions in the publicsuffix list.

This is WIP PR because I don't really understand the exact semantics of wildcard=False .

Comment on lines +231 to +232
assert 'jp' == psl.get_tld('kobe.jp')
assert 'com' == psl.get_tld('amazonaws.com')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither kobe.jp nor amazonaws.com is in https://publicsuffix.org/list/public_suffix_list.dat .

Copy link
Contributor

@KnitCode KnitCode Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reordered for clarity...

*.kobe.jp was in the psl at some point, as are several subdomains of amazon.aws.
so these are good examples of the behavior you want to fix.

the public suffix list does change regularly, so we have to take some care that we aren't dependent on creating tests that depend on something we don't control. the behavior should be independent of the actual PSL. Also, as we noted in Vadym's pull request, some of the unit tests on the mozilla site were faulty (didn't follow the stated algorithm).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our tests shouldn't be reliant on moving data. we should use a fixed test file.
otherwise, agree... unless the wildcards are turned off, and then these become entries in the list.

Copy link
Contributor Author

@hiratara hiratara Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our tests shouldn't be reliant on moving data. we should use a fixed test file.

I perfectly agree with you. However, we should create another issue to solve that problem.

I found that we already use static data for testing.

https://github.com/nexB/python-publicsuffix2/blob/develop/src/publicsuffix2/public_suffix_list.dat

tests.py Outdated
assert 'com.pg' == psl.get_tld('telinet.com.pg', wildcard=True)
assert 'pg' == psl.get_tld('telinet.com.pg', wildcard=False)
assert None == psl.get_tld('telinet.com.pg', wildcard=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use *.pg when we specify wildcard=False . Moreover, publicsuffix list doesn't contain pg .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i agree here. certain companies remove the wildcards in the PSL so they can reduce the variety of output. if *.pg is in the list, which it usually is, then, when they remove the wildcard, pg will be in the list. It doesn't remove the entry itself, just the wildcard.

we could argue that users shouldn't use the list this way (but they do), and i don't think we want to have a situation where a user puts in a legitimate domain and gets None for the tld. In this particular case, e.g., 'pg' is a real TLD, so it should never return None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to this, I now understand the semantics of wildcard=False. I fix the behavior on c6575d5 .

Copy link
Contributor

@KnitCode KnitCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking through this. i made comments via the wildcard to help explain. if the README isn't clear enough, we can add a few more lines.
Primary concern is on wildcard use and that we should never return a None TLD when the domain contains a legit one.

tests.py Outdated
@@ -212,8 +212,8 @@ def test_get_sld_exceptions(self):
def test_get_sld_no_wildcard(self):
psl = publicsuffix.PublicSuffixList()
# test completion when no wildcards should be processed
assert 'com.pg' == psl.get_sld('telinet.com.pg', wildcard=False)
expected = 'ap-southeast-1.elb.amazonaws.com'
assert 'pg' == psl.get_sld('telinet.com.pg', wildcard=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't agree with this one. when no wildcards are used (or totally regardless of the PSL at all), pg is a TLD. and so the sld is going to be one level up, i.e., 'com.pg'. It's never the case that 'pg' is an SLD.

tests.py Outdated
assert 'com.pg' == psl.get_sld('telinet.com.pg', wildcard=False)
expected = 'ap-southeast-1.elb.amazonaws.com'
assert 'pg' == psl.get_sld('telinet.com.pg', wildcard=False)
expected = 'amazonaws.com'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this test was made when '*.amazonaws.com' was in the PSL -- so like Vadym's tests we should probably use a set list that isn't changing. In that case, same logic would happen for this. the wildcard goes away, and the item is processed as a suffix. I guess the readme isn't clear enough on this?

anyhow, currently this is not wildcarded in the PSL so it looks wrong, but the test should be on a fixed example input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and it seems that *amazonaws.com has never been on the publicsuffix list before.

$ git remote -v
origin  ssh://git@github.com/publicsuffix/list.git (fetch)
origin  ssh://git@github.com/publicsuffix/list.git (push)
$ git log -S'*.amazonaws.com' -p
$

Comment on lines +231 to +232
assert 'jp' == psl.get_tld('kobe.jp')
assert 'com' == psl.get_tld('amazonaws.com')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our tests shouldn't be reliant on moving data. we should use a fixed test file.
otherwise, agree... unless the wildcards are turned off, and then these become entries in the list.

tests.py Outdated
assert 'com.pg' == psl.get_tld('com.pg', wildcard=True)
assert 'pg' == psl.get_tld('com.pg', wildcard=False)
assert None == psl.get_tld('com.pg', wildcard=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should never return none for a valid TLD.

@hiratara hiratara changed the title [WIP] make get_tld return only eTLDs in the publicsuffix list make get_tld return only eTLDs in the publicsuffix list Jul 24, 2020
@hiratara
Copy link
Contributor Author

@KnitCode I fix the behavior of wildcard=False by c6575d5 . What do you think?

@KnitCode
Copy link
Contributor

KnitCode commented Dec 3, 2021

@hiratara I'm sorry somehow I lost track of this -- and a pandemic. I'm really sorry. Happy to check again. I checked in today because I realized that the code is not handling the private suffixes as expected... and we have tests for them so I'm super confused. Do you want to revisit and merge this PR? there is another one on Nov 3rd from someone else.

The bug i see is that we should have

get_sld('this.that.internal') = 'that.internal'

as the documentation and tests show, but instead we have

get_sld('this.that.internal') = 'internal'

@hiratara
Copy link
Contributor Author

hiratara commented Jan 8, 2022

Hello @KnitCode,

as the documentation and tests show, but instead we have
get_sld('this.that.internal') = 'internal'

Unfortunately, it has always behaved that way. I tried to check out https://github.com/nexB/python-publicsuffix2/releases/tag/release-2.2019-08-12 (before I contributed this repository) and found that get_sld('this.that.internal') was already returning 'internal'. The document was also the same as now.

https://github.com/nexB/python-publicsuffix2/blob/b7fae069079a79cdb9ad7525c1fb0005f131faf4/src/publicsuffix2/__init__.py#L244-L246

I also tried 9cd5417, the commit on which the document was added, but I got the same result. I think the documentation is wrong.

I understand it is a bug, but it is not related this PR. We should open a new issue.

@KnitCode
Copy link
Contributor

@hiratara oh interesting. I checked my old old versions as well. I think i wrote that part of the documentation, so it must have been a mistake in the documentation. I would like to see it return this.local but in any case... where are we at on this PR? there is another one waiting behind it.

I think we need @pombredanne to merge anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants