Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It should be possible to prevent the use of the fallback snapshot when using the CLI #259

Closed
nicholas-plutoflume opened this issue Mar 17, 2022 · 4 comments · Fixed by #260

Comments

@nicholas-plutoflume
Copy link
Contributor

nicholas-plutoflume commented Mar 17, 2022

When you initialise the cache, either through the CLI or through update(fetch_now=True), it calls it on a TLDExtract object that has the default suffix list and fallback_to_snapshot set to True. This causes problems if you later try to use a TLDExtract object that points to the same cache directory, but with fallback_to_snapshot set to False.
Because fallback_to_snapshot is one of the hashed arguments, it will generate a totally different cache path, and miss the cache set previously through update.

I believe it would make sense to make one of two modifications:

  1. Allow the user to explicitly pass through fallback_to_snapshot to the update function / CLI.
  2. Or, ignore the fallback_to_snapshot parameter when constructing the cache path.

I believe option 2 is better for the following reasons:

  • It doesn't make sense to call update in a scenario where the intent is to use fallback_to_snapshot, because you can use tldextract without pre-seeding the cache, and just use the snapshot instead.
  • The pivotal element should only be the set of public suffix urls. If those change, it makes sense to use a different cache file for lookups.

To summarise (assuming the prefix list urls are the same):

  • If fallback_to_snapshot is turned on or off after you cache items it won't matter because you've already cached the suffix lists, and that's where it should get the data from.
  • I can't think of a scenario where you would have the cache off and have two different TLDExtract objects (with the same prefix list urls) where for one the fallback is on, for the other it is off.

I'll make a PR to fix this using method 2. But I'm curious to hear if you agree with my analysis.
~

@nicholas-plutoflume
Copy link
Contributor Author

nicholas-plutoflume commented Mar 17, 2022

Actually, I'm not sure I agree with what I just wrote.

If you run something with the fallback turned on, it makes sense that if you later run an app with the fallback turned off, you wouldn't want it grabbing items from the cache which have previously had the fallback turned on.

Instead I propose passing the fallback option through the CLI. This allows you to explicitly state when populating the cache whether it should be allowed to use the fallback as a source of truth.

@nicholas-plutoflume nicholas-plutoflume changed the title fallback_to_snapshot should not influence the cache file name It should be possible to prevent the use of the fallback snapshot when using the CLI Mar 17, 2022
@john-kurkowski
Copy link
Owner

Related: #251

@nicholas-plutoflume
Copy link
Contributor Author

nicholas-plutoflume commented Mar 21, 2022

Interesting, hadn't read that issue.
I'd argue it's worth adding that CLI option, because with the change in caching behaviour previously working code no longer works.

Previously, it was totally valid to run tldextract --update --private_domains, then in your code create the object with TLDExtract(suffix_list_urls=None, fallback_to_snapshot=False).
The intention there was to prevent the code from ever using the internet or snapshot if the cache isn't populated.

But under version 3.0, because it considers the fallback_to_snapshot as True when running --update, you then have to initialise your object with fallback_to_snapshot=True, even if you really, really don't want it to fallback to the snapshot. With the PR I linked this should be fixed, because you can now populate your cache with the same options as the object that you later create, preemptively preventing you from ever falling back on the snapshot.

But then again I don't shoulder the burden of maintaining any extra command-line options, so I can't exactly judge 😅

@john-kurkowski
Copy link
Owner

Wow, thank you for thinking through that! I'm sold.

I did previously consider maintaining additional CLI args a burden. Perhaps it's actually a fun challenge. Like automatically generating the interface, if it became truly burdensome. But that's way in the distance. Your fix today is simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants