None of the cache directory is writable #61

ishanbhagawati · 2020-02-06T08:51:21Z

We are using URLExtract in one of our python projects. The python script works fine locally but when uploaded as AWS lambda the script fails as none of the cache directories is writable.

Ideally, there should have been a way to provide the cache directory path as a URL parameter in the URLExtract constructor itself and the default should be whatever is currently

lipoja · 2020-02-06T08:58:23Z

Hi, thanks for reporting this issue. OK there is option for setting up this directory when you are using the class it self within your python code. I missed this and forget to add this feature as parameter to the urlextract command.

Thanks!

ishanbhagawati · 2020-02-06T09:21:14Z

@lipoja do you mind sharing how to do it in python code as I saw in code URLExtract takes a CacheFile type argument

lipoja · 2020-02-06T09:27:32Z

It can be set during object initialization: extractor = URLExtract(cache_dir='/path/to/cache')

B16f00t · 2020-02-18T20:36:03Z

I am developing a tool, I use pyinstaller, with the noconsole option and it gives cache problems.
I use "extractor = URLExtract(cache_dir=''D:\temp')"
but it seems like no such path is established

lipoja · 2020-02-20T14:53:57Z

@B16f00t Does it crash? Is there any error message shown? Or it just does not work at all?

B16f00t · 2020-02-21T18:07:56Z

My code:

    from urlextract import URLExtract
    try:
             extractor = URLExtract(cache_dir="D:\\")
             urls = extractor.find_urls(text)
             print(urls)
    except Exception as e:
        print(e)

ERROR:
Default cache file does not exist 'C:\Users\B16f00t\AppData\Local\Temp_MEI102122\urlextract\data\tlds-alpha-by-domain.txt'!

lipoja · 2020-03-24T21:21:42Z

@B16f00t Turn on logging and see if there are any other messages. It looks like the path "D:\" does not exist, or is not writable for the user.

Let me know if it logs some other errors (or info messages), thanks.

beanbean0113 · 2020-07-14T02:50:57Z

Hi @lipoja, I have the same issue. The script when uploaded as AWS lambda, the script fails.
I init the UrlExtract with a cache_dir. However, we get the error log "None of cache directories is writable."

It seems we set the "cache_dir" in the initialization step. But the "cache_dir" wasn't given in _get_cache_lock_file_path().
For example,

def _load_cached_tlds(self):
"""
Loads TLDs from cached file to set.
:return: Set of current TLDs
:rtype: set
"""

        # check if cached file is readable
        if not os.access(self._tld_list_path, os.R_OK):
            self._logger.error("Cached file is not readable for current "
                               "user. ({})".format(self._tld_list_path))
            raise CacheFileError(
                "Cached file is not readable for current user."
            )

        set_of_tlds = set()

        with filelock.FileLock(self._get_cache_lock_file_path()):

with filelock.FileLock(self._get_cache_lock_file_path()): should be with filelock.FileLock(self._get_cache_lock_file_path(cache_dir)):.

line 204 may has the same problem as well with filelock.FileLock(self._get_cache_lock_file_path()):.

Thank you.

RickMeasham · 2021-03-09T05:05:51Z

Hi @lipoja, is there a fix for this yet? There's a fix_cache branch that appears to be aimed at fixing this issue.

Below are my observations before I saw that branch ...

There appear to be two intermingled issues here:

Setting cache_dir is only used for the cache file and not the lock file
When using the default list provided with the distribution, we still attempt to lock the file. And because the cache_dir isn't used for the lock file, we cannot get that lock

I was going to suggest that we shouldn't bother getting a lock on the distributed file, but I see that you allow the module to update the distributed file. I feel like that's an anti-pattern and we should have a cache file somewhere in user-space (or even /tmp if we can't write somewhere more permanent). The default file should be a read-only failover.

Unless that is changed, then the only solution is to create a lock file somewhere we are (reasonably) always allowed to write. Given it's a lock file and thus very ephemeral we could just write that to /tmp or the OS equivalent. Or we could allow the user to specify a lock_file in the constructor.

RickMeasham · 2021-03-09T05:18:51Z

Case in point why we should not be modifying a distributed file:

  Switching clone ./src/urlextract to git://github.com/rickmeasham/URLExtract.git (to revision master)
  Running command git config remote.origin.url git://github.com/rickmeasham/URLExtract.git
  Running command git checkout -q master
  error: Your local changes to the following files would be overwritten by checkout:
        urlextract/data/tlds-alpha-by-domain.txt
  Please commit your changes or stash them before you switch branches.

RickMeasham · 2021-03-09T05:58:33Z

I've grabbed the fix_cache branch, but when the cache file is not available in the cache_dir, calling update() attempts to update the default file. To fix this for now, I'm generating an empty file before initialising the extractor, then calling update if the file is empty. This hack works even when the module is in a Lambda layer.

    cache_file = Path('/tmp/tlds-alpha-by-domain.txt')
    if not cache_file.exists():
        cache_file.touch()

    extractor = URLExtract(cache_dir='/tmp')

    if cache_file.stat().st_size == 0:
        extractor.update()

paulfdietrich · 2021-05-09T23:24:38Z

Any update on this bug? Would love the use this in AWS. Are folks just using a package from the fix_cache branch with Ricks work-around? Is there a way to PIP install that branch?

RickMeasham · 2021-05-10T00:44:24Z

@paulfdietrich said:

Is there a way to PIP install that branch?

You can use this, but I'm not going to be tracking any changes to the official repo. Caveat emptor.

pip install https://github.com/rickmeasham/URLExtract/archive/master.zip#egg=urlextract

lipoja · 2021-05-10T20:45:53Z

Hi, thank you both for your patience and also for time spend on reporting this issue.
I've modified the code in fix_cache branch, could you check if it is working for you? It is getting late and do not have "the non working" environment handy.

I kept the solution that was in the fix_cache branch and I've tried to fix the issue with non existing file.

Thank you!

lipoja · 2021-05-10T20:48:22Z

Case in point why we should not be modifying a distributed file:

  Switching clone ./src/urlextract to git://github.com/rickmeasham/URLExtract.git (to revision master)
  Running command git config remote.origin.url git://github.com/rickmeasham/URLExtract.git
  Running command git checkout -q master
  error: Your local changes to the following files would be overwritten by checkout:
        urlextract/data/tlds-alpha-by-domain.txt
  Please commit your changes or stash them before you switch branches.

I think this can be solved by adding tlds-alpha-by-domain.txt to .gitignore. I did that already so it should be fine once your .gitignore is updated from latest master branch.

lipoja · 2021-06-12T21:00:27Z

This fix is released on pypi. If somebody has a chance to run it on aws it would be graet.
Thanks

… used in lambda, so change the logic to regular expressions Please refer lipoja/URLExtract#61

konono pushed a commit to konono/holoscope that referenced this issue Feb 8, 2022

[Mod] URLextract requires a cache file to be created, which cannot be…

4f68f9b

… used in lambda, so change the logic to regular expressions Please refer lipoja/URLExtract#61

xhagrg mentioned this issue Apr 9, 2024

Refactor URL extraction logic in url_validator.py NASA-IMPACT/pyQuARC#278

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

None of the cache directory is writable #61

None of the cache directory is writable #61

ishanbhagawati commented Feb 6, 2020

lipoja commented Feb 6, 2020

ishanbhagawati commented Feb 6, 2020

lipoja commented Feb 6, 2020

B16f00t commented Feb 18, 2020

lipoja commented Feb 20, 2020

B16f00t commented Feb 21, 2020 •

edited

Loading

lipoja commented Mar 24, 2020

beanbean0113 commented Jul 14, 2020

RickMeasham commented Mar 9, 2021

RickMeasham commented Mar 9, 2021

RickMeasham commented Mar 9, 2021

paulfdietrich commented May 9, 2021

RickMeasham commented May 10, 2021

lipoja commented May 10, 2021

lipoja commented May 10, 2021

lipoja commented Jun 12, 2021

None of the cache directory is writable #61

None of the cache directory is writable #61

Comments

ishanbhagawati commented Feb 6, 2020

lipoja commented Feb 6, 2020

ishanbhagawati commented Feb 6, 2020

lipoja commented Feb 6, 2020

B16f00t commented Feb 18, 2020

lipoja commented Feb 20, 2020

B16f00t commented Feb 21, 2020 • edited Loading

lipoja commented Mar 24, 2020

beanbean0113 commented Jul 14, 2020

RickMeasham commented Mar 9, 2021

RickMeasham commented Mar 9, 2021

RickMeasham commented Mar 9, 2021

paulfdietrich commented May 9, 2021

RickMeasham commented May 10, 2021

lipoja commented May 10, 2021

lipoja commented May 10, 2021

lipoja commented Jun 12, 2021

B16f00t commented Feb 21, 2020 •

edited

Loading