-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Put spaCy data in a shared path #868
Comments
Hi, I do understand the pain on this, because it's sometimes inconvenient to have all my development copies of spaCy. I know it's becoming common for libraries to drop files in the home directory now, but I don't think it's a great pattern. I think it makes it much harder to reason about how the file-system state affects what's being executed. I use virtualenv etc because I want to isolate my projects from each other, so I'm often unimpressed when libraries go behind my back to set up shared state. Some suggestions for your use-case. If you're happy to have another install step, you could make a command that replaces the data directory with a symbolic link to your shared location. If you can't find a way to make this nice, or you want a really "1 click" procedure that just uses pip, you could make a library that does this, and put it on PyPi. I hadn't thought of this before, but I think it might be useful to others too. I could see myself using it in some situations, for instance. Matt |
I'm actually going to be creating a I completely agree though that the method of hiding data in a |
That starts to introduce permissions problems --- most systems don't give user accounts write access to those directories. And then if you require If you're creating the |
The problem with the symlink is that I don't necessarily know where a user will have anaconda installed nor would I know what their environment is called. If they create a new environment after the data package is installed, they'd have to download the data to their environment because the symlink wouldn't exist there. I don't really see any issue with permissions; just have it use the first directory that exists and is writable. If a user runs I guess the main thing I'd want is automatic checking in |
I guess this comes down to a matter of taste. I find that sort of behaviour really unappealing. I'm not sure what the best solution is for you, given all your constraints. But I do think it'll be quite easy to point spaCy to save and load data to some path by default. You'll just need to decide which path to use. |
The XDG standard has a directory for data(default: ~/.local/share/application-name). This is used by many apps including pip. |
Just to give you a heads-up – this will be fixed in v1.7! You'll be able to store your data wherever you want, and download and install models directly, or using the new We're just in the process of reuploading all the models (taking a bit longer than expected, because we've trained new models and decided to provide different options, i.e. with GloVe vectors and without). But as soon as they're up, we'll push the new release and docs 🎉 |
@ines: will there be a specific set of search paths built-in that wouldn't require the models to be manually loaded or linked? |
The data path So
The new models are:
Update: Found a better solution to the way symbols are added to the vocab, so that vocabularies remain compatible across spaCy versions. This means the current models can still be used with the new code. We're also releasing a new smaller English model with vectors (~50MB, 2% less accurate than larger model). New larger models will then follow in v2.0. |
Just pushed v1.7.0! 🎉 |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Would it be possible to have spaCy data work similarly to NLTK_data where it goes to a shared path, i.e.,
C:\nltk_data
for Windows,/usr/local/share/nltk_data
for macOS, or/usr/share/nltk_data
for Unix (obviously substitutingspacy_data
fornltk_data
)?I understand that I can have it download to a custom location but it would be nice to have it look for it automatically rather than having to set
spacy.util.set_data_path()
before callingspacy.load()
, or by passing apath
argument tospacy.en.English
.My use case for this is deploying it in computer labs, were it'd be preferable for me to be able to package and deploy the data without each user having to download it individually. Especially in cases where each user has an
~/anaconda
folder since the data downloads to~/anaconda/lib/python3.5/site-packages/spacy/en/data
for each user. It'd be (selfishly) easier for a user to be able to use spacy without me telling them where the data is and without them filling up the HD.If there's a reason that it's done the way it currently is, that's fine :)
The text was updated successfully, but these errors were encountered: