Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache entire public suffix list. Select at runtime. #144

Closed

Conversation

brycedrennan
Copy link
Collaborator

@brycedrennan brycedrennan commented Feb 21, 2018

Addresses #66 but is not backwards compatible.

Changes

  • Moves include_psl_private_domains to the __call__ method. This is now something you choose on a per-call basis.
  • The entire dataset from publicsuffix.org is saved to cache
  • Added 'source' attribute to named tuple which tells you which suffix list the url was matched against
  • Ensured no weird cache issues happen when using with different suffix_list_urls by using different filenames per suffix_list_urls
  • Updates the bundled snapshot

@floer32
Copy link
Collaborator

floer32 commented Mar 3, 2019

@brycedrennan this is great --

i do suspect it could be backwards compatible without that being a significant setback. it would be good to be able to proceed with this without it being a major version upgrade/ breaking people

do you have things fresh enough in your mind to take a whack at making it compatible (and ensuring TravisCI is passing)? if not, let me know, and i can take a turn

@floer32
Copy link
Collaborator

floer32 commented Mar 4, 2019

@brycedrennan also --

Updates the bundled snapshot

appreciate you trying to update the snapshot but it appears that you deleted and did not commit the new one?

@brycedrennan
Copy link
Collaborator Author

brycedrennan commented Mar 4, 2019

It just doesn’t have new lines anymore. It’s a different format.

@floer32
Copy link
Collaborator

floer32 commented Mar 4, 2019

@brycedrennan ohh right, now I see

What's the rationale for switching to JSON encoding? so JSON is because of the different batches ... i'm wondering if it would be better to stick with with newline-separated. as for how the list is separated into N categories, maybe the snapshots and cache files should be split up into N files. 'parsing' by doing splitlines is much faster than parsing JSON

@john-kurkowski thoughts?

@floer32 floer32 added the ⬆️ prioritized for one reason or another, it's prioritized :) label Mar 4, 2019
@floer32
Copy link
Collaborator

floer32 commented Mar 4, 2019

marked as prioritized because this could solve a couple of open issues at once

@brycedrennan
Copy link
Collaborator Author

@hangtwenty It's been a while and I'll dig into it later but my recollection is that this has to be a breaking change. If we increment the version properly to 3.0.0 that shouldn't cause anyone problems.

The reason it needs to be a "breaking" change is that the previous behavior was that the first call of tldextract per machine would forever specify whether or not private domains were included for all calls afterwords. This made it impossible to use tldextract in both modes simultaneously, but also would return unexepected results since choices made at runtime were ignored in favor of whatever was in the cache. For example:

extract_private = tldextract.TLDExtract(include_psl_private_domains=True)
extract_public = tldextract.TLDExtract(include_psl_private_domains=False)

# these shouldn't be the same but will be currently
extract_private('waiterrant.blogspot.com`) == extract_public('waiterrant.blogspot.com')

I'll also look at the CI build failures, but if my memory serves me, they were inconsistently passing. I'll revisit it though.

@floer32
Copy link
Collaborator

floer32 commented Mar 5, 2019

@brycedrennan I understand that this is a fix resulting in different (more correct) behavior. And, understood that if someone has been hitting undefined behavior, they could be subtly depending on undefined/unspecified behavior, and have a change now. sometimes in such a case it's good to force people to break to be aware of the update. i don't think it's the right way to do it here though. (can discuss further though... that's just my initial sense)

The nature of a cache file ... it does not have to be breaking. Old snapshot or cache can be ignored. Will be using new cache keys so new way will Just Work.

As for moving the argument from the TLDExtract constructor to the call, it does not need to be a breaking change either, both call signatures could be supported. That's not dissimilar from the requests library, where one can set up arguments in the Session constructor, or in individual calls (and the individual calls can override)

Am I making sense?

@floer32
Copy link
Collaborator

floer32 commented Mar 5, 2019

I do think the JSON aspect needs to be changed before merge as well (unless someone does some benchmarking to confirm it's OK, but TBH that seems like a yak-shave and it would be better to just make the change)

So the two criteria that are pending, before merge:

  1. change should not break callers at code level (unless it's shown that it really really has to)

2. persist to plaintext/ newline-separated, because the json overhead does not offer clear enough benefit, and we know for sure that people use this library in performance-sensitive contexts (massive scale projects doing many checks) 👌

@brycedrennan
Copy link
Collaborator Author

It's also breaking because ExtractResult has been changed.

@floer32
Copy link
Collaborator

floer32 commented Mar 6, 2019

Ah, right. Need to think about whether to keep that part ... I can see the pros but I can also see the cons... thinking about subclassing namedtuple and making source into optional metadata. it's not the same category of information as the other fields in ExtractResult

@floer32
Copy link
Collaborator

floer32 commented Mar 6, 2019

I can take a stab at making the changes non-breaking — LMK

@brycedrennan
Copy link
Collaborator Author

Sorry to not have more time to respond.

  1. Thats fine if we want to have the behavior still be configured at the object level like before. In my book its a breaking change no matter what because of the behavior change but I'm okay with whatever you decide.
  2. if we load the cache from plaintext then we have to do processing to get it into a usable data structure, json is saved in the format that lets it be used immediately. I think JSON decoding is plenty fast but I also don't see why it matters. If I'm processing millions of urls I'm only loading up the cache file into memory once per process. If someone is using some massively distributed system to do 1 url per process then they already have a lot of overhead per call and this will not matter.
  3. fine removing the source or moving it to an object attribute

@brycedrennan
Copy link
Collaborator Author

brycedrennan commented Mar 6, 2019

On my computer, loading the json data file takes less than 1ms.

import time

def test_load_cachefile():
    start = time.perf_counter()
    num = 1000
    for _ in range(num):
        data = TLDExtract._get_snapshot_tld_extractor()
    duration = time.perf_counter() - start
    print('')
    per_load = (duration*1000)/num
    print('%s ms per load' % per_load)

0.7280355759430677 ms per load

@floer32
Copy link
Collaborator

floer32 commented Mar 6, 2019

@brycedrennan thanks for doing that check! You raise a good point there in ((2)), callers can ensure they aren't doing redundant inits, so it does not need to be overengineered here.

@brycedrennan
Copy link
Collaborator Author

If you haven't already. Gonna give it a go at addressing your feedback.

@floer32
Copy link
Collaborator

floer32 commented Mar 7, 2019

@brycedrennan agreed and yeah I was thinking about that. It could still be technically a breaking change if downstream relied on isinstance etc ... I don't mind breaking those use-cases because they are un-Pythonic, but it still qualifies as major-release in terms of changelong

but! maybe we can do it with duck-typing and then also faking the type a bit. While faking the type seems like a smell (overriding multiple dunder methods (__ methods), namedtuple itself does that. And even though first reaction is 'that smells', on closer thought I don't think it's harmful; I don't see much that can go wrong practically.

Notes on what methods:

https://stackoverflow.com/questions/6803597/how-to-fake-type-with-python

@floer32
Copy link
Collaborator

floer32 commented Mar 7, 2019

@brycedrennan or maybe there's even a way to stick with 'real' named tuples and only the tiniest bit of type-manipulation.

  1. two types - ExtractResult that is the old way and totally normal namedtuple - and a new one ExtractResultWithMetadata (or some better name). the new one is ideally a namedtuple too, but could just quack like one, if necessary. the new one needs to have that type-faking to claim it is a subclass of ExtractResult, which allows isinstance(ExtractResult) ... to work still.
  2. add a method namedtuple_with_defaults or somesuch. just wrapping namedtuple with overload/dispatch based on whether default-kwargs were specific or left as None. this is pretty clean if one recalls that namedtuple is itself a function not a class :)
  3. that method returns either of the classes based on whether defaults are given... for our code it'll alway be giving source and always be the new class. but downstream should still be able to keep code that uses ExtractResult more closely/directly

@brycedrennan what do you think? if it seems basically sane, I'm happy to take a whack at it

@john-kurkowski
Copy link
Owner

@john-kurkowski what do you think about - if we cut a major version after this is merged? (sometime soon but with a bit of leeway to review open issues and see if anything else should go in)

Sounds good. Can either of you summarize the breaking changes & upgrade steps in CHANGELOG.md in this PR? Put it at the top in a new "Unreleased" section? I don't normally update CHANGELOG.md outside of master or prior to a release. But if it's going to be a breaking change, and there's been a lot of back and forth discussion, I think the summary would benefit everybody.

.gitignore Outdated
@@ -7,3 +7,9 @@ tldextract_app/tldextract
tldextract_app/web
tldextract.egg-info
.tox
.idea
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: can we keep this file focused on just what this project needs to ignore? Personal preferences e.g. editor or platform should be in a personal .gitignore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know I could do that - thanks!

README.md Outdated
ExtractResult(subdomain='', domain='127.0.0.1', suffix='')
```

If you want to rejoin the whole namedtuple, regardless of whether a subdomain
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way to do this now?

@brycedrennan
Copy link
Collaborator Author

I'd be okay dropping the whole idea of storing source and thus avoiding all the complications.

@floer32
Copy link
Collaborator

floer32 commented Mar 9, 2019

@brycedrennan Agree, I can't convince myself any of the complicated options are worth going magical/weird for, as I continue thinking about it. Glad we discussed them though, thank you!

I had another 💡 today --

Maybe we can get the best of both worlds

TLDExtract configuration could either have a flag about whether to add the trace-ability/metadata field(s). Default False ➡️ non-breaking, returns same exact ExtractResult as today. True ➡️ returns ExtractResultPlus (or some better name), and that has source. Or support composition allowing a function to be passed in for returning the result. (... YAGNI?)

Or another way to 'get the best of both worlds' would be to do some thoughtful logging support.

Either of those ways could be punted to another issue IMHO, we could drop source for now to get this merged.

Thoughts?

@brycedrennan
Copy link
Collaborator Author

brycedrennan commented Mar 12, 2019

So I’ve been working on a larger refactor and realized that having the plain text version of the list in the repo is super useful for testing. So im backtracking despite my previous arguments.

I also think we can make something that acts like a named tuple and have it not be horrible.

So current plan for this PR:

  • remove source from result
  • go back to plaintext
  • request your review again

Adding back source, and other improvements can be in a different PR.

@floer32
Copy link
Collaborator

floer32 commented Mar 14, 2019

@brycedrennan cheers, sounds good. Also nice bumping into you on the Poetry repo haha

@brycedrennan brycedrennan force-pushed the separate-icann-private branch 4 times, most recently from d9ca293 to afc4066 Compare March 18, 2019 04:13
@brycedrennan
Copy link
Collaborator Author

brycedrennan commented Mar 18, 2019

Okay made the changes.

I switched the snapshot to be just a copy of the raw text. I thought that this was the format before my changes, but looking at the history, it was not. I would argue that having it be the same content that we pull from the public server will allow us to simplify some code in the future and use the raw file in tests. (I've already made those changes in another, not yet pushed, branch)

If everyone is okay with these recent changes I'll squash the latest commit into the first one so we have a cleaner commit history.

@floer32
Copy link
Collaborator

floer32 commented Mar 18, 2019

@brycedrennan Last commit looks great — looking at the full diff I am wondering if there’s something else to change — but I’m multitasking right now and it could be PEBCAK on my end. I’ll poke around again tonight

@brycedrennan brycedrennan force-pushed the separate-icann-private branch 3 times, most recently from 1427048 to f946a74 Compare March 19, 2019 05:05
@john-kurkowski
Copy link
Owner

Holy bump from the past. Thanks for your patience on this one. 😓

I merged in latest master and took a stab at the CHANGELOG in my separate-icann-private branch (I don't think I have write access to this PR). Did I capture things correctly?

@brycedrennan
Copy link
Collaborator Author

Superceded by #207

john-kurkowski pushed a commit that referenced this pull request Oct 10, 2020
This is a second attempt at doing what was done in #144. Addresses #66.

- Add `include_psl_private_domains` to the `__call__` method.  This is now something you can choose on a per-call basis.  The object level argument now is only a default value for each call.
- The entire dataset from publicsuffix.org is saved to cache
- Ensure no weird cache issues happen when using with different `suffix_list_urls` by using different filenames per `suffix_list_urls`
- Use filelock to support multiprocessing and multithreading use cases
- Update bundled snapshot to be the raw publicsuffix data. Need to look at performance impact of this.
- Breaking change `cache_file` => `cache_dir`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⬆️ prioritized for one reason or another, it's prioritized :)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants