Make spaCy-nlp functions faster #65

jbesomi · 2020-07-11T09:02:57Z

(Edit)

Almost all functions of the nlp module under-the-hoods make use of spaCy.

In general, spaCy is quite fast as it uses Cython.

The core code looks like this:

new_data = []
for row in nlp.pipe(s.values, batch_size=32):
        new_data.append( ... row ...)

spacy pipe has been initially chosen as it's multi-threading. An alternative might be to use apply ( probably is slower).

The pipe functions have among other the n_threads as well as the batch_size arguments. Tuning this values might be very important.

This task consists in:

Understand spaCy pipe
Test on a large dataset different combinations of n_threads and batch_size value
(it it make sense) Compare this results with the pandas apply approach
Pick the best solution and implement it in all NLP functions that uses spaCy under-the-hoods

We might find that the optimal values of n_threads and batch_size are not always the same, in this case, we will need to add it as arguments to the NLP functions and update the docstring.

Useful resources:

Turbo-charge your spaCy NLP pipeline

The text was updated successfully, but these errors were encountered:

mk2510 · 2020-08-07T16:55:42Z

Hi Jonny,

we will now start with this issue and implement multi_processing, wherever it is useful. 🚀

jbesomi · 2020-08-07T16:59:12Z

Hi Max,

I guess this task might take quite long time, what if we prioritize to completely finish part 2 of the "API next checklist" and then move on to part 4?

henrifroese · 2020-08-07T18:36:39Z

Will do 🥈

jbesomi · 2020-08-19T16:36:49Z

Dask vs. spaCy

It's faster to use pipe from spaCy or to directly use Dask (Dask Dataframe)?

Dask might be the solution we were looking for ...

jbesomi added enhancement New feature or request help wanted Extra attention is needed labels Jul 11, 2020

This was referenced Jul 14, 2020

👩‍💻 API next steps: checklist #85

Open

count(s) and term_frequency(s) #92

Merged

Added the function to POS tag #106

Merged

jbesomi mentioned this issue Aug 13, 2020

Add POS tagging #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make spaCy-nlp functions faster #65

Make spaCy-nlp functions faster #65

jbesomi commented Jul 11, 2020 •

edited

Loading

mk2510 commented Aug 7, 2020

jbesomi commented Aug 7, 2020

henrifroese commented Aug 7, 2020

jbesomi commented Aug 19, 2020

Make spaCy-nlp functions faster #65

Make spaCy-nlp functions faster #65

Comments

jbesomi commented Jul 11, 2020 • edited Loading

mk2510 commented Aug 7, 2020

jbesomi commented Aug 7, 2020

henrifroese commented Aug 7, 2020

jbesomi commented Aug 19, 2020

jbesomi commented Jul 11, 2020 •

edited

Loading