Cache entire public suffix list. Select at runtime. #144

brycedrennan · 2018-02-21T03:46:23Z

Addresses #66 but is not backwards compatible.

Changes

Moves include_psl_private_domains to the __call__ method. This is now something you choose on a per-call basis.
The entire dataset from publicsuffix.org is saved to cache
Added 'source' attribute to named tuple which tells you which suffix list the url was matched against
Ensured no weird cache issues happen when using with different suffix_list_urls by using different filenames per suffix_list_urls
Updates the bundled snapshot

floer32 · 2019-03-03T23:59:02Z

i do suspect it could be backwards compatible without that being a significant setback. it would be good to be able to proceed with this without it being a major version upgrade/ breaking people

do you have things fresh enough in your mind to take a whack at making it compatible (and ensuring TravisCI is passing)? if not, let me know, and i can take a turn

floer32 · 2019-03-04T00:03:13Z

@brycedrennan also --

Updates the bundled snapshot

appreciate you trying to update the snapshot but it appears that you deleted and did not commit the new one?

brycedrennan · 2019-03-04T00:06:29Z

It just doesn’t have new lines anymore. It’s a different format.

floer32 · 2019-03-04T01:18:48Z

@brycedrennan ohh right, now I see

~~What's the rationale for switching to JSON encoding?~~ so JSON is because of the different batches ... i'm wondering if it would be better to stick with with newline-separated. as for how the list is separated into N categories, maybe the snapshots and cache files should be split up into N files. 'parsing' by doing splitlines is much faster than parsing JSON

@john-kurkowski thoughts?

floer32 · 2019-03-04T01:19:55Z

marked as prioritized because this could solve a couple of open issues at once

brycedrennan · 2019-03-04T03:02:34Z

@hangtwenty It's been a while and I'll dig into it later but my recollection is that this has to be a breaking change. If we increment the version properly to 3.0.0 that shouldn't cause anyone problems.

The reason it needs to be a "breaking" change is that the previous behavior was that the first call of tldextract per machine would forever specify whether or not private domains were included for all calls afterwords. This made it impossible to use tldextract in both modes simultaneously, but also would return unexepected results since choices made at runtime were ignored in favor of whatever was in the cache. For example:

extract_private = tldextract.TLDExtract(include_psl_private_domains=True)
extract_public = tldextract.TLDExtract(include_psl_private_domains=False)

# these shouldn't be the same but will be currently
extract_private('waiterrant.blogspot.com`) == extract_public('waiterrant.blogspot.com')

I'll also look at the CI build failures, but if my memory serves me, they were inconsistently passing. I'll revisit it though.

floer32 · 2019-03-05T18:50:46Z

@brycedrennan I understand that this is a fix resulting in different (more correct) behavior. And, understood that if someone has been hitting undefined behavior, they could be subtly depending on undefined/unspecified behavior, and have a change now. sometimes in such a case it's good to force people to break to be aware of the update. i don't think it's the right way to do it here though. (can discuss further though... that's just my initial sense)

The nature of a cache file ... it does not have to be breaking. Old snapshot or cache can be ignored. Will be using new cache keys so new way will Just Work.

As for moving the argument from the TLDExtract constructor to the call, it does not need to be a breaking change either, both call signatures could be supported. That's not dissimilar from the requests library, where one can set up arguments in the Session constructor, or in individual calls (and the individual calls can override)

Am I making sense?

floer32 · 2019-03-05T18:55:11Z

I do think the JSON aspect needs to be changed before merge as well (unless someone does some benchmarking to confirm it's OK, but TBH that seems like a yak-shave and it would be better to just make the change)

So the two criteria that are pending, before merge:

change should not break callers at code level (unless it's shown that it really really has to)

2. persist to plaintext/ newline-separated, because the json overhead does not offer clear enough benefit, and we know for sure that people use this library in performance-sensitive contexts (massive scale projects doing many checks) 👌

brycedrennan · 2019-03-06T02:05:01Z

It's also breaking because ExtractResult has been changed.

floer32 · 2019-03-06T05:36:33Z

Ah, right. Need to think about whether to keep that part ... I can see the pros but I can also see the cons... thinking about subclassing namedtuple and making source into optional metadata. it's not the same category of information as the other fields in ExtractResult

floer32 · 2019-03-06T05:37:09Z

I can take a stab at making the changes non-breaking — LMK

brycedrennan · 2019-03-06T06:09:42Z

Sorry to not have more time to respond.

Thats fine if we want to have the behavior still be configured at the object level like before. In my book its a breaking change no matter what because of the behavior change but I'm okay with whatever you decide.
if we load the cache from plaintext then we have to do processing to get it into a usable data structure, json is saved in the format that lets it be used immediately. I think JSON decoding is plenty fast but I also don't see why it matters. If I'm processing millions of urls I'm only loading up the cache file into memory once per process. If someone is using some massively distributed system to do 1 url per process then they already have a lot of overhead per call and this will not matter.
fine removing the source or moving it to an object attribute

brycedrennan · 2019-03-06T06:20:11Z

On my computer, loading the json data file takes less than 1ms.

import time

def test_load_cachefile():
    start = time.perf_counter()
    num = 1000
    for _ in range(num):
        data = TLDExtract._get_snapshot_tld_extractor()
    duration = time.perf_counter() - start
    print('')
    per_load = (duration*1000)/num
    print('%s ms per load' % per_load)

0.7280355759430677 ms per load

floer32 · 2019-03-06T22:25:39Z

@brycedrennan thanks for doing that check! You raise a good point there in ((2)), callers can ensure they aren't doing redundant inits, so it does not need to be overengineered here.

brycedrennan · 2019-03-07T05:03:54Z

If you haven't already. Gonna give it a go at addressing your feedback.

floer32 · 2019-03-07T18:32:45Z

@brycedrennan agreed and yeah I was thinking about that. It could still be technically a breaking change if downstream relied on isinstance etc ... I don't mind breaking those use-cases because they are un-Pythonic, but it still qualifies as major-release in terms of changelong

but! maybe we can do it with duck-typing and then also faking the type a bit. While faking the type seems like a smell (overriding multiple dunder methods (__ methods), namedtuple itself does that. And even though first reaction is 'that smells', on closer thought I don't think it's harmful; I don't see much that can go wrong practically.

Notes on what methods:

https://stackoverflow.com/questions/6803597/how-to-fake-type-with-python

floer32 · 2019-03-07T18:39:01Z

@brycedrennan or maybe there's even a way to stick with 'real' named tuples and only the tiniest bit of type-manipulation.

two types - ExtractResult that is the old way and totally normal namedtuple - and a new one ExtractResultWithMetadata (or some better name). the new one is ideally a namedtuple too, but could just quack like one, if necessary. the new one needs to have that type-faking to claim it is a subclass of ExtractResult, which allows isinstance(ExtractResult) ... to work still.
add a method namedtuple_with_defaults or somesuch. just wrapping namedtuple with overload/dispatch based on whether default-kwargs were specific or left as None. this is pretty clean if one recalls that namedtuple is itself a function not a class :)
that method returns either of the classes based on whether defaults are given... for our code it'll alway be giving source and always be the new class. but downstream should still be able to keep code that uses ExtractResult more closely/directly

@brycedrennan what do you think? if it seems basically sane, I'm happy to take a whack at it

john-kurkowski · 2019-03-07T21:02:44Z

@john-kurkowski what do you think about - if we cut a major version after this is merged? (sometime soon but with a bit of leeway to review open issues and see if anything else should go in)

Sounds good. Can either of you summarize the breaking changes & upgrade steps in CHANGELOG.md in this PR? Put it at the top in a new "Unreleased" section? I don't normally update CHANGELOG.md outside of master or prior to a release. But if it's going to be a breaking change, and there's been a lot of back and forth discussion, I think the summary would benefit everybody.

john-kurkowski · 2019-03-07T21:04:19Z

.gitignore

@@ -7,3 +7,9 @@ tldextract_app/tldextract
 tldextract_app/web
 tldextract.egg-info
 .tox
+.idea


Minor: can we keep this file focused on just what this project needs to ignore? Personal preferences e.g. editor or platform should be in a personal .gitignore.

I didn't know I could do that - thanks!

john-kurkowski · 2019-03-07T21:05:51Z

README.md

-ExtractResult(subdomain='', domain='127.0.0.1', suffix='')
-```
-
-If you want to rejoin the whole namedtuple, regardless of whether a subdomain


Is there no way to do this now?

brycedrennan · 2019-03-07T21:52:54Z

I'd be okay dropping the whole idea of storing source and thus avoiding all the complications.

floer32 · 2019-03-09T03:00:05Z

@brycedrennan Agree, I can't convince myself any of the complicated options are worth going magical/weird for, as I continue thinking about it. Glad we discussed them though, thank you!

I had another 💡 today --

Maybe we can get the best of both worlds

TLDExtract configuration could either have a flag about whether to add the trace-ability/metadata field(s). Default False ➡️ non-breaking, returns same exact ExtractResult as today. True ➡️ returns ExtractResultPlus (or some better name), and that has source. Or support composition allowing a function to be passed in for returning the result. (... YAGNI?)

Or another way to 'get the best of both worlds' would be to do some thoughtful logging support.

Either of those ways could be punted to another issue IMHO, we could drop source for now to get this merged.

Thoughts?

brycedrennan · 2019-03-12T16:57:31Z

So I’ve been working on a larger refactor and realized that having the plain text version of the list in the repo is super useful for testing. So im backtracking despite my previous arguments.

I also think we can make something that acts like a named tuple and have it not be horrible.

So current plan for this PR:

remove source from result
go back to plaintext
request your review again

Adding back source, and other improvements can be in a different PR.

floer32 · 2019-03-14T16:43:11Z

@brycedrennan cheers, sounds good. Also nice bumping into you on the Poetry repo haha

brycedrennan · 2019-03-18T05:24:49Z

Okay made the changes.

I switched the snapshot to be just a copy of the raw text. I thought that this was the format before my changes, but looking at the history, it was not. I would argue that having it be the same content that we pull from the public server will allow us to simplify some code in the future and use the raw file in tests. (I've already made those changes in another, not yet pushed, branch)

If everyone is okay with these recent changes I'll squash the latest commit into the first one so we have a cleaner commit history.

floer32 · 2019-03-18T16:00:59Z

@brycedrennan Last commit looks great — looking at the full diff I am wondering if there’s something else to change — but I’m multitasking right now and it could be PEBCAK on my end. I’ll poke around again tonight

Addresses john-kurkowski#66 but is not backwards compatible.

Handles concurrency issues.

john-kurkowski · 2020-08-03T19:22:19Z

Holy bump from the past. Thanks for your patience on this one. 😓

I merged in latest master and took a stab at the CHANGELOG in my separate-icann-private branch (I don't think I have write access to this PR). Did I capture things correctly?

brycedrennan · 2020-10-09T05:19:48Z

Superceded by #207

This is a second attempt at doing what was done in #144. Addresses #66. - Add `include_psl_private_domains` to the `__call__` method. This is now something you can choose on a per-call basis. The object level argument now is only a default value for each call. - The entire dataset from publicsuffix.org is saved to cache - Ensure no weird cache issues happen when using with different `suffix_list_urls` by using different filenames per `suffix_list_urls` - Use filelock to support multiprocessing and multithreading use cases - Update bundled snapshot to be the raw publicsuffix data. Need to look at performance impact of this. - Breaking change `cache_file` => `cache_dir`

brycedrennan force-pushed the separate-icann-private branch 3 times, most recently from d308b92 to e31272a Compare February 21, 2018 04:08

This was referenced Sep 10, 2018

No more weird caching issues. CircleUp/tldextract#2

Merged

Public/private results depend on how cache file was persisted #66

Closed

Fixed concurrent access to cache file #146

Merged

floer32 self-assigned this Mar 3, 2019

floer32 self-requested a review March 3, 2019 23:58

floer32 assigned brycedrennan Mar 3, 2019

This was referenced Mar 4, 2019

Could you protect cache file against lib updates ? #153

Open

Weird bug with dashes and .de #157

Closed

floer32 requested a review from john-kurkowski March 4, 2019 01:18

floer32 added the ⬆️ prioritized for one reason or another, it's prioritized :) label Mar 4, 2019

brycedrennan force-pushed the separate-icann-private branch from 78a546e to 3f8c901 Compare March 4, 2019 03:50

floer32 added the hold: performance concerns until changed label Mar 5, 2019

john-kurkowski reviewed Mar 7, 2019

View reviewed changes

brycedrennan force-pushed the separate-icann-private branch 4 times, most recently from d9ca293 to afc4066 Compare March 18, 2019 04:13

brycedrennan force-pushed the separate-icann-private branch 3 times, most recently from 1427048 to f946a74 Compare March 19, 2019 05:05

brycedrennan added 2 commits March 18, 2019 22:15

feat: cache entire public suffix list. Select at runtime.

07260a9

Addresses john-kurkowski#66 but is not backwards compatible.

feat: add lockfile to cache.

010a3af

Handles concurrency issues.

brycedrennan force-pushed the separate-icann-private branch from f946a74 to 010a3af Compare March 19, 2019 05:16

chrisbraucker mentioned this pull request Nov 10, 2019

TLDExtract Private Domains for Dynamic DNS Providers AnalogJ/lexicon#461

Closed

john-kurkowski removed the hold: performance concerns until changed label Aug 3, 2020

This was referenced Aug 4, 2020

Import errors occur when used inside spark #118

Closed

Allow cache_file argument to also accept list #204

Closed

brycedrennan mentioned this pull request Oct 8, 2020

feature: private tlds can be used at call-time #207

Merged

brycedrennan closed this Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache entire public suffix list. Select at runtime. #144

Cache entire public suffix list. Select at runtime. #144

brycedrennan commented Feb 21, 2018 •

edited

Loading

floer32 commented Mar 3, 2019 •

edited

Loading

floer32 commented Mar 4, 2019

brycedrennan commented Mar 4, 2019 •

edited

Loading

floer32 commented Mar 4, 2019

floer32 commented Mar 4, 2019 •

edited

Loading

brycedrennan commented Mar 4, 2019

floer32 commented Mar 5, 2019 •

edited

Loading

floer32 commented Mar 5, 2019 •

edited

Loading

brycedrennan commented Mar 6, 2019

floer32 commented Mar 6, 2019

floer32 commented Mar 6, 2019

brycedrennan commented Mar 6, 2019

brycedrennan commented Mar 6, 2019 •

edited

Loading

floer32 commented Mar 6, 2019 •

edited

Loading

brycedrennan commented Mar 7, 2019

floer32 commented Mar 7, 2019

floer32 commented Mar 7, 2019 •

edited

Loading

john-kurkowski commented Mar 7, 2019

john-kurkowski Mar 7, 2019

brycedrennan Mar 7, 2019

john-kurkowski Mar 7, 2019

brycedrennan commented Mar 7, 2019

floer32 commented Mar 9, 2019 •

edited

Loading

brycedrennan commented Mar 12, 2019 •

edited

Loading

floer32 commented Mar 14, 2019

brycedrennan commented Mar 18, 2019 •

edited

Loading

floer32 commented Mar 18, 2019

john-kurkowski commented Aug 3, 2020

brycedrennan commented Oct 9, 2020

Cache entire public suffix list. Select at runtime. #144

Cache entire public suffix list. Select at runtime. #144

Conversation

brycedrennan commented Feb 21, 2018 • edited Loading

floer32 commented Mar 3, 2019 • edited Loading

floer32 commented Mar 4, 2019

brycedrennan commented Mar 4, 2019 • edited Loading

floer32 commented Mar 4, 2019

floer32 commented Mar 4, 2019 • edited Loading

brycedrennan commented Mar 4, 2019

floer32 commented Mar 5, 2019 • edited Loading

floer32 commented Mar 5, 2019 • edited Loading

brycedrennan commented Mar 6, 2019

floer32 commented Mar 6, 2019

floer32 commented Mar 6, 2019

brycedrennan commented Mar 6, 2019

brycedrennan commented Mar 6, 2019 • edited Loading

floer32 commented Mar 6, 2019 • edited Loading

brycedrennan commented Mar 7, 2019

floer32 commented Mar 7, 2019

floer32 commented Mar 7, 2019 • edited Loading

john-kurkowski commented Mar 7, 2019

john-kurkowski Mar 7, 2019

Choose a reason for hiding this comment

brycedrennan Mar 7, 2019

Choose a reason for hiding this comment

john-kurkowski Mar 7, 2019

Choose a reason for hiding this comment

brycedrennan commented Mar 7, 2019

floer32 commented Mar 9, 2019 • edited Loading

brycedrennan commented Mar 12, 2019 • edited Loading

floer32 commented Mar 14, 2019

brycedrennan commented Mar 18, 2019 • edited Loading

floer32 commented Mar 18, 2019

john-kurkowski commented Aug 3, 2020

brycedrennan commented Oct 9, 2020

brycedrennan commented Feb 21, 2018 •

edited

Loading

floer32 commented Mar 3, 2019 •

edited

Loading

brycedrennan commented Mar 4, 2019 •

edited

Loading

floer32 commented Mar 4, 2019 •

edited

Loading

floer32 commented Mar 5, 2019 •

edited

Loading

floer32 commented Mar 5, 2019 •

edited

Loading

brycedrennan commented Mar 6, 2019 •

edited

Loading

floer32 commented Mar 6, 2019 •

edited

Loading

floer32 commented Mar 7, 2019 •

edited

Loading

floer32 commented Mar 9, 2019 •

edited

Loading

brycedrennan commented Mar 12, 2019 •

edited

Loading

brycedrennan commented Mar 18, 2019 •

edited

Loading