Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_tld(), get_components(), and other #17

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

vadym-t
Copy link

@vadym-t vadym-t commented Apr 27, 2020

Reworked the code of get_tld() to reduce the number of temporary objects (reduce the load on GC in the case of bulk operations). The requirement for the input domain name is updated - it allows to have consistent behavior of all methods. The updated requirement: input domain string is a FQDN without trailing dot. Empty labels are not allowed.

  1. get_tld() is split into get_tld() and get_tld_unsafe(). The latter implements the entire logic and the former performs lower() conversion and None checks.
  2. get_tld_unsafe() rewritten to create minimal number of temporary objects
  3. get_sld() split info get_sld() and get_sld_unsafe(). The latter is based on get_tld_unsafe(), same principles.
  4. get_components() and get_components_unsafe() method introduced. These methods allow split the domain string into 3-tuple: (prefix, SLL, TLD) where SLL is second-level label and TLD is TLD or ETLD

…irements. Input domain string is a FQDN without trailing dot. Empty labels are not allowed.

1. get_tld() is split into get_tld() and get_tld_unsafe(). The latter implements the entire logic and the former performs lower() conversion and None checks.
2. get_tld_unsafe() rewritten to create minimal number of temporary objects
3. get_sld() split info get_sld() and get_sld_unsafe(). The latter is based on get_tld_unsafe(), same principles.
4. get_components() and get_components_unsafe() method are introduced. These methods allow split the domain string into 3-tuple: (prefix, SLL, TLD) where SLL is second-level label and TLD is TLD or ETLD
…nts. Input domain string is a FQDN without trailing dot. Empty labels are not allowed. Explatatory comments prepend modification when necessary.
…nts. Input domain string is a FQDN without trailing dot. Empty labels are not allowed. Explatatory comments prepend modification when necessary.
…l (hopefully all) possible configuration of public-suffix list. Full consistent validation of results of get_tld, get_sld and get_components methods provided for each configuration. As get_sld and get_components method are based on get_tld, get_tld test cases are more generic (include more complex variants). The former ones check only the difference in the configuration from the previous variant.
…es. Eventually it should cover the full set of test cases and merged into README.
Copy link
Collaborator

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadym-t Thank you ++
That's a big PR and I reckon this would change the API and behavior.
I would rather avoid doing so as it would break existing code. Would there be a way to achieve something similar without changing behavior? (e.g. adding new functions or optional args?)

Also you have introduced Python 3-only idioms that make the test fails. For now, I would like to keep things working on Python 2.7 too. Do you think this could be reworked?

@pombredanne
Copy link
Collaborator

@hiratara @KnitCode ping... your review is much welcomed.

@KnitCode
Copy link
Contributor

@vadym-t Thank you ++
That's a big PR and I reckon this would change the API and behavior.
I would rather avoid doing so as it would break existing code. Would there be a way to achieve something similar without changing behavior? (e.g. adding new functions or optional args?)

Also you have introduced Python 3-only idioms that make the test fails. For now, I would like to keep things working on Python 2.7 too. Do you think this could be reworked?

Hi. I've worked with Vadym extensively over the last week+ on this logic. He is on our team and identified logical bugs and performance issues. I haven't yet code reviewed the latest version, but tested earlier version extensively w/o issue. To your point, though, didn't test python2.7 at all.

The primary changes are two-fold:

  • changes to the computation that reduces the creation of temporary objects. this has the same result as the previous code. In my testing, it was slightly faster than the old version on datasets that would fit in memory for get_tld() and get_sld(). for large distributed sets, the timing was the same but less objects created. this helps for Dask/Spark usage of the library.
  • test changes. vadym dug into the algorithm on the mozilla site and compared to the tests (theirs and ours). there were tests in their suite that are inconsistent and appear to be based on an old PSL list, but passed for us because of default logic in our existing code, which itself was faulty on edge cases. a number of the issues occur at the edge that most users, e.g, web page usage, wouldn't see but we see in very large DNS systems.

Vadym asked what to do -- and I suggested fix the tests to what they really should be and do a PR. In addition, he added a new suite which starts from the published algorithm and addresses all the combinations.

So overall, I think it should be merged and will be an upgrade for everyone. the functionality breaks will occur at the edge for very large-scale DNS type things where you'd see bad behavior and root queries. We plan to adopt in Infoblox Data Science.

will re-review the code and documentation

Copy link
Contributor

@KnitCode KnitCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main comments are on:

  • documentation suggestions to add a few things,
  • in the tests, where i'm expecting a strip to occur in the function get_sld(), it seems the tests do not. also, i don't see any tests for the unsafe versions --that's where i'd expect to see the none behavior.

am i missing something?

where the backward compatibility breaks are

  • edge cases that are rare, e.g., having root return None instead of empty string -- don't have strong opinion on this.
  • cases where the tests didn't match the published algorithm

This file describes exact behavior of methods for different edge cases and
explains general logic. This description covers the behavior of get_tld,
get_tld_unsafe, get_sld, get_sld_unsafe, split_domain, split_domain_unsafe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadym-t suggest to add here a why sentence:
Unsafe versions of the methods will significantly save resources on large-scale applications of the library where the data has already been converted to lowercase and missing data has a None value. This can be done in Spark/Dask, for example, and result in a significant reduction in computational resources. For adhoc usage, the original functions are sufficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

as leading dots are not processed.
split_domain method is based on get_sld method - it returns everything in
front of get_sld() as a prefix.
Specifically to example above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadym-t you might also add here a non-edge case example. the split_domain() method offers a new capability to the library-- one that folks might get from other libraries -- but your only example is the edge. suggest something like:
split_domain allows you to recover the host, or prefix, of an SLD, for use in aggregation or analysis based on the labels. e.g., split_domain('www.googl.com')

get_*_unsafe() does not perform if the input string is None and does not
transforms it to the lower case.

2. The listed above methods works only with non-canonical FQDN strings -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listed above means all or just the unsafe methods?

Edge cases and expected behavior
The behavior of the library can be illustrated best on the small examples:
(boolean arguments are omitted if does not affect behavior )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested additional:
The library allows you to create a public suffix list from any file-like object, including a list. In the examples below, we construct the PSL with different lists to demonstrate the functionality of the library under different conditions. In particular, the examples show an empty list, list of two elements, and the use of negation in a list. Combinations of the method parameters are given with notes about the result. These results follow the logic of the Mozilla library as published on the psl.org website.

@@ -78,7 +81,8 @@ def test_get_sld_from_list_with_exception_rule(self):

def test_get_sld_from_list_with_fqdn(self):
psl = publicsuffix.PublicSuffixList(['com'])
assert 'example.com' == psl.get_sld('example.com.')
# 'example.com.' -> <example>.<com>.<empty> -> None, empty labels are not allowed
assert None == psl.get_sld('example.com.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't follow this comment. is this failing because of the trailing dot? I read the code as the get_sld() version will remove the trailing dot ... so why doesn't this go to ..
example.com -> example, com -> example.com
??

@@ -284,13 +297,15 @@ def test_get_sld_backward_compatibility_sld_for_empty_string(self):

def test_get_sld_backward_compatibility_sld_for_fqdn(self):
psl = publicsuffix.PublicSuffixList()
assert 'foo.com' == psl.get_sld('www.foo.com.')
# 'www.foo.com.' -> <www>.<foo>.<com>.<empty> -> None, empty labels are not allowed
assert None == psl.get_sld('www.foo.com.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question abut strip

i don't see any tests for the unsafe methods. shouldn't there be a few? this is what the big change is.

assert None == psl.get_sld('.', strict=True)
assert '' == psl.get_sld('.', wildcard=False)
assert None == psl.get_sld('.', wildcard=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not backward compatible but logically more accurate

@@ -583,40 +583,49 @@ def test_get_sld_Mixed_case3(self):
assert 'example.com' == publicsuffix.get_sld('WwW.example.COM')

def test_get_sld_Leading_dot1(self):
assert 'com' == publicsuffix.get_sld('.com')
# '.com' -> <empty>.<com> -> None, empty labels are not allowed
assert None == publicsuffix.get_sld('.com')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm going to continue having all the strip questions - so i won't repeat them.


def test_get_sld_Leading_dot2(self):
assert 'example' == publicsuffix.get_sld('.example')
# '.example' -> <empty>.<example> -> None, empty labels are not allowed
assert None == publicsuffix.get_sld('.example')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the difference here is that example is not in the list. so in strict mode should return None and in non-strict mode would return example --- with the strip

correct?


def test_get_sld_Unlisted_sld1(self):
assert 'example' == publicsuffix.get_sld('example')

def test_get_sld_Unlisted_sld2(self):
assert 'example' == publicsuffix.get_sld('example.example')
# non-strict: TLD:example, SLD:example.example
assert 'example.example' == publicsuffix.get_sld('example.example')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a change where the original logic was flawed -- didn't follow the algorithm complete-- so the test itself was incorrect

@hiratara
Copy link
Contributor

_lookup_node doesn't seem to be used anymore.

https://github.com/nexB/python-publicsuffix2/blob/e95f43d26d57692fc1f153c8c9bbde62dabeaae7/src/publicsuffix2/__init__.py#L201-L240

I'm all for improving the execution speed. However, we are cautious about changing existing behavior. The reason the specs of this library change so often is that the specs are so vague. It's hard to determine if this specification change is reasonable (although I'm not actively opposed to it).

@KnitCode
Copy link
Contributor

KnitCode commented May 1, 2020

_lookup_node doesn't seem to be used anymore.

https://github.com/nexB/python-publicsuffix2/blob/e95f43d26d57692fc1f153c8c9bbde62dabeaae7/src/publicsuffix2/__init__.py#L201-L240

I'm all for improving the execution speed. However, we are cautious about changing existing behavior. The reason the specs of this library change so often is that the specs are so vague. It's hard to determine if this specification change is reasonable (although I'm not actively opposed to it).

agree, and my brain hadn't noticed that. @pombredanne I would think the priorities would be in rough order:

  • correcting faulty logic and unit tests based on the published algorithm; these corrections do a good job of documenting the algorithm and reveal faults in the original Mozilla tests that rely on the existence of certain things (e.g. example) being the psl which aren't. i like the demonstrated combinations using a set list in the use_caeses.md
  • ensuring backward compatibility, outside of being wrong :) ... so that the functions people are using today work as expected, and added functions give more controls, as needed,
  • py2.7 support for older use cases,
  • performance improvements.. that's always a fine thing.

agreed? given that, wondering if the merge can be split to get these. if the performance improvements suggested by @vadym-t only work with py3 idioms and don't use the trie, what about using an extra flag parameter to trigger usage of the new functions?

i can take a closer look on the weekend and rerun the experiments i did with an earlier version.

@pombredanne
Copy link
Collaborator

@vadym-t ping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants