Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider that TLDExtract cache will be used by default when evaluating WPAD #74

Closed
KarelChanivecky opened this issue May 9, 2023 · 6 comments

Comments

@KarelChanivecky
Copy link
Contributor

TLDExtract performs an HTTP query to fetch valid top level domains. This is fine, except that this library will be mostly run within the context of a domain where proxy is enforced.

Enterprises that enforce proxying, are also likely to block requests that are not dispatched per policy. For this reason, it doesn't make sense to dispatch an HTTP request with the purpose of evaluating the proxy, as the proxy URL is more likely than not to be needed to dispatch such request.

For this reason, it should be considered that the base case for the library is that TLDExtract will not be able to dispatch this request and that it will fallback to the file with the TLDs.

Hence this library should:

  • not use TLD extract
  • already include a copy of the TLD file that TLDExtract is fetching and direct the cache_dir to the containing directory
  • Contain a function for setting the cache_dir. One can just set an env var for this, but for the sake of convenience and to provide a point where this is guaranteed to be set before using(basically initializing the lib).
  • Make it explicit in the documentation that TLDEXTRACT_CACHE, XDG_CACHE_HOME, HOME, or python-tldextract environment variables will be used to evaluate the cache dir, and falling back to the directory containing the TLDExtract package itself

Some of the options used by TLDExtract are not bad at all, however, they are not able to accommodate all cases. For example, within a pyinstaller executable, in which the package directory itself will be the location where the executable is located. In such cases where the application is being distributed on scale, the application may choose to contain a specific directory for such uses. Thus, applying one of the recommendations would be meaningful, and avoid the implementer a deep-dive into foreign code.

@KarelChanivecky
Copy link
Contributor Author

I have resolved the issue in my project by overriding the TLDExtract DiskCache class with my own cache implementation. My implementation reads from a user defined file and keeps the data. It also allows for setting the file contents. I also added a feature to check if the file contents have changed every certain time interval.

In my project's case, this will allow us to specify the filename where we want the TLD data stored. The stored data can now be accessed with a privileged writer, unprivileged reader pattern, as the readers will never try to write to the file. Fetching and maintaining the data can then be outsourced to a different module.

I will try to contribute to tldextract with such an implementation, then I can try to contribute here based off of that.

@carsonyl
Copy link
Owner

Does this relate to #64? The intent was for PyPAC to use tldextract such that it never goes online and never updates its TLD list.

@KarelChanivecky
Copy link
Contributor Author

We got a crash in our project because we are packaging everything with pyinstaller and the directory structure is different. Hence, why we had to come up with a hack.
Just now I realize there was no actual request dispatched although the stack trace suggests it. There may be some other scenarios when this comes into play.

@carsonyl carsonyl closed this as completed Aug 5, 2023
@Guts
Copy link

Guts commented Oct 4, 2024

We got a crash in our project because we are packaging everything with pyinstaller and the directory structure is different. Hence, why we had to come up with a hack. Just now I realize there was no actual request dispatched although the stack trace suggests it. There may be some other scenarios when this comes into play.

Hi @KarelChanivecky, we're facing the exact same issue but within a packaged application with PyInstaller. Could you share your workaround/hack please?

@KarelChanivecky
Copy link
Contributor Author

@Guts First try setting TLDEXTRACT_CACHE env var to a dir with a list of top level domains. before any pypac imports. In that dir you should have a json file with a list of top level domains. You can find reputable lists online.

If that did not work, you can always create your own cache implementation and set it to:

from pypac.wpad import no_fetch_extract
no_fetch_extract.suffix_list_urls = ()
no_fetch_extract._cache = MyCacheImplementation()

You cache implementation should match the interface of DiskCache defined here:
https://github.com/john-kurkowski/tldextract/blob/master/tldextract/cache.py

@Guts
Copy link

Guts commented Oct 4, 2024

@KarelChanivecky thanks for your quick reply and hint.

Actually, I'm trying something else: embedding tld files into the final package and set the path using the env var.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants