Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make spaCy-nlp functions faster #65

Open
jbesomi opened this issue Jul 11, 2020 · 4 comments
Open

Make spaCy-nlp functions faster #65

jbesomi opened this issue Jul 11, 2020 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@jbesomi
Copy link
Owner

jbesomi commented Jul 11, 2020

(Edit)

Almost all functions of the nlp module under-the-hoods make use of spaCy.

In general, spaCy is quite fast as it uses Cython.

The core code looks like this:

new_data = []
for row in nlp.pipe(s.values, batch_size=32):
        new_data.append( ... row ...)

spacy pipe has been initially chosen as it's multi-threading. An alternative might be to use apply ( probably is slower).

The pipe functions have among other the n_threads as well as the batch_size arguments. Tuning this values might be very important.

This task consists in:

  1. Understand spaCy pipe
  2. Test on a large dataset different combinations of n_threads and batch_size value
  3. (it it make sense) Compare this results with the pandas apply approach
  4. Pick the best solution and implement it in all NLP functions that uses spaCy under-the-hoods
  • We might find that the optimal values of n_threads and batch_size are not always the same, in this case, we will need to add it as arguments to the NLP functions and update the docstring.

Useful resources:

Turbo-charge your spaCy NLP pipeline

@jbesomi jbesomi added enhancement New feature or request help wanted Extra attention is needed labels Jul 11, 2020
@mk2510
Copy link
Collaborator

mk2510 commented Aug 7, 2020

Hi Jonny,

we will now start with this issue and implement multi_processing, wherever it is useful. 🚀

@jbesomi
Copy link
Owner Author

jbesomi commented Aug 7, 2020

Hi Max,

I guess this task might take quite long time, what if we prioritize to completely finish part 2 of the "API next checklist" and then move on to part 4?

@henrifroese
Copy link
Collaborator

Will do 🥈 :neckbeard:

@jbesomi jbesomi mentioned this issue Aug 13, 2020
@jbesomi
Copy link
Owner Author

jbesomi commented Aug 19, 2020

Dask vs. spaCy

It's faster to use pipe from spaCy or to directly use Dask (Dask Dataframe)?

Dask might be the solution we were looking for ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants