Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No more weird caching issues. #2

Merged
merged 6 commits into from
Oct 26, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,33 @@
# tldextract

## Python Module [![PyPI version](https://badge.fury.io/py/tldextract.svg)](https://badge.fury.io/py/tldextract) [![Build Status](https://travis-ci.org/john-kurkowski/tldextract.svg?branch=master)](https://travis-ci.org/john-kurkowski/tldextract)
## CircleUp Fork

This fork fixes an issue with how the original version caches the Public Suffix List (see [PR](https://github.com/john-kurkowski/tldextract/pull/144)).

The upstream version caches one particular version of the list (either with or without private suffixes),
depending on the settings you use the first time you call it. All subsequent calls, return using that originally cached list.
Any calls with different settings than the original silently return incorrect results.

This fork caches the entire list and filters at query time based on the specified arguments.

This change is not backwards compatible.

Changes
- Moves `include_psl_private_domains` to the `__call__` method. This is now something you choose on a per-call basis.
- The entire dataset from publicsuffix.org is saved to cache
- Added 'source' attribute to named tuple which tells you which suffix list the url was matched against
- Ensured no weird cache issues happen when using with different `suffix_list_urls` by using different filenames per `suffix_list_urls`
- Deletes the bundled snapshot

**How to use**
```python
import tldextract
tldextract.extract('http://forums.news.cnn.com/', include_psl_private_domains=True)
```

The documentation below has been unmodified from upstream.

## Python Module

`tldextract` accurately separates the gTLD or ccTLD (generic or country code
top-level domain) from the registered domain and subdomains of a URL. For
Expand Down
4 changes: 4 additions & 0 deletions pypi_deploy.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
devpi use http://pypi.cu
devpi login root --password=
devpi use root/circleup
devpi upload
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,13 +34,13 @@
LONG_DESCRIPTION_MD = __doc__
LONG_DESCRIPTION = re.sub(r'(?s)\[(.*?)\]\((http.*?)\)', r'\1', LONG_DESCRIPTION_MD)

INSTALL_REQUIRES = ["setuptools", "idna", "requests>=2.1.0", "requests-file>=1.4"]
INSTALL_REQUIRES = ["setuptools", "idna", "requests>=2.1.0", "requests-file>=1.4", 'filelock>=3.0.8']
if (2, 7) > sys.version_info:
INSTALL_REQUIRES.append("argparse>=1.2.1")

setup(
name="tldextract",
version="2.2.0",
version="3.0.1.circleup",
author="John Kurkowski",
author_email="john.kurkowski@gmail.com",
description=("Accurately separate the TLD from the registered domain and "
Expand Down
19 changes: 12 additions & 7 deletions tldextract/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import os.path
from hashlib import md5

from filelock import FileLock
from six import wraps


Expand All @@ -22,15 +23,19 @@ def return_cache(*args, **kwargs):

cache_path = path + '/' + key + '.json'
cache_path = cache_path.replace('//', '/')
lock_path = cache_path + '.lock'

if not os.path.isfile(cache_path):
result = func(*args, **kwargs)
make_dir(cache_path)
with open(cache_path, 'w') as cache_file:
json.dump(result, cache_file)
make_dir(cache_path)

with open(cache_path) as cache_file:
return json.load(cache_file)
# without locking, Concurrency could lead to multiple writers, or readers of a partial write
with FileLock(lock_path, timeout=20):
if not os.path.isfile(cache_path):
result = func(*args, **kwargs)
with open(cache_path, 'w') as cache_file:
json.dump(result, cache_file)

with open(cache_path) as cache_file:
return json.load(cache_file)

return return_cache

Expand Down