Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-Up Preprocessing + NLP #162

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

henrifroese
Copy link
Collaborator

@henrifroese henrifroese commented Aug 23, 2020

We spent a lot of time looking at every possible options to speed up the library. Main findings:

  1. str.replace and s.apply(re.sub) are equally fast (see Speed up preprocessing module #124) -> don't need to change this
  2. dask requires users to use different data structures themselves, so it reduces complexity significantly for users. We believe only few users will profit from this when they're using datasets that do not fit into RAM, so it does not make sense to change all of texthero for this.
  3. modin as drop-in replacement for pandas performed very bad for small to medium sized DFs
  4. pandarallel did not work for pd.apply inside the library
  5. Implementing multithreading through a decorator will not work in python; see every result when googling python multiprocessing decorator

The only way we see to parallelize things while still keeping a great User Experience and not slowing down users with small and medium sized datasets is the following:

Implement a function helper.parallel that handles parallelization for Pandas Series and use that to wrap our functions. Example:

# Example old approach:
@InputSeries(TextSeries)
def remove_round_brackets(s: TextSeries) -> TextSeries:
    """
    some nice docstring
    """
    return s.str.replace(r"\([^()]*\)", "")     

# Example new approach: (same for user from the outside, 
# a little more complicated (but fully parallelized!) on the inside)

def _remove_round_brackets(s: TextSeries) -> TextSeries:
    return s.str.replace(r"\([^()]*\)", "")


@InputSeries(TextSeries)
def remove_round_brackets(s: TextSeries) -> TextSeries:
    """
    some nice docstring
    """
    return parallel(s, _remove_round_brackets)

So we keep all functionality that is not parallelizable (like initializing patterns, downloading a spacy model, ...) in a function f with a nice docstring that the user can see. This function at the end calls helper.parallel(s, _f, other arguments) where _f houses all the parallelizable functionality.

Here's the helper.parallel implementation, should be rather self-explanatory:

import pandas as pd
import multiprocessing as mp

cores = mp.cpu_count()
partitions = cores

MIN_LINES_FOR_PARALLELIZATION = 10000  # Set min. number of lines in Series to parallelize -> no slowdown for small datasets due to parallelization overhead
PARALLELIZE = True  # Allows users to fully turn off parallelization for all dataset sizes


def parallel(s, func, *args, **kwargs):

    if len(s) < MIN_LINES_FOR_PARALLELIZATION or not PARALLELIZE:
        # Execute as usual.
        return func(s, *args, **kwargs)

    else:
        # Execute in parallel.

        # Split the data up into batches.
        s_split = np.array_split(s, partitions)

        # Open threadpool.
        pool = mp.Pool(cores)
        # Execute in parallel and concat results (order is kept).
        s_result = pd.concat(
            pool.map(functools.partial(func, *args, **kwargs), s_split)
        )

        pool.close()
        pool.join()

        return s_result

We know that this does introduce some added complexity for developers in the nlp and preprocessing modules, but we believe that it's the best solution for users (automatic parallelization and same user experience as before) and with some more developer documentation, other contributors should not have any problems.

MultiProcessing Implementation in Texthero.pdf

Of course, lots of tests are added to test_helper.py.

Note: only so many lines changed because this builds upon #157

@henrifroese henrifroese marked this pull request as ready for review August 24, 2020 16:28
@mk2510 mk2510 mentioned this pull request Aug 25, 2020
@henrifroese
Copy link
Collaborator Author

@jbesomi , we have now tested more with Modin and thought about different implementations. The problem is that there is no modin.apply or something similar that we can just use. If we have a modin Series s we profit from the speedup of doing s.apply(...). So we would need the users to do import modin.pandas as pd and we would also need to change our library everywhere to return modin-pandas objects. We see the following options:

  1. Use our implementation from above for parallelization that is extremely simple for users and relatively simple for developers. Describe in a tutorial that if users want to work with huge datasets that don't fit into RAM, they need a different solution (e.g. modin) that does much more than just parallelizing (partitioning DataFrames etc.)

  2. Change our library to always check at the beginning if the input is a modin object. If it is, only use code that's optimized with modin (that's not too restrictive as modin covers a lot of the pandas API) and return modin objects. Else, just work as usual. Then just tell users "if you work with big datasets, just import modin.pandas as pd and we take care of the rest".

  3. Fully switch to modin. That's possible and the easiest way, but very slow for everything that's below ~ a few 100k rows.

  4. Change our library to check if the input is over MIN_LINES_FOR_PARALLELIZATION. If it is, transform the input to a modin object, profit from the modin parallelization, and re-transform the output from modin to pandas. This adds a big overhead for the converisons, so we're not a fan.

Options 3+4 seem like a bad idea. Option 1 is the easiest as it's already ready to merge and gives users a great speed boost for data that fits into RAM (probably the case for most users). Option 2 is probably the best for users with big datasets, as they will profit from the whole modin framework (partitioning etc.) and can use our library fully parallelized; of course, it's a little more for us to code, but not that much 🥴 🥈 .

Interested in what you think :octocat:

@jbesomi
Copy link
Owner

jbesomi commented Sep 1, 2020

Thank you for the detailed comments!

  1. I'm a fan of this approach! 🎉
  2. That's not for now. We can start with 1, see how it goes, and eventually do 2.
  3. Not a big fan.
  4. It would probably be the best approach in case the transformation isn't expensive, but that's not the case, right? Can you quantify it?

@mk2510
Copy link
Collaborator

mk2510 commented Sep 1, 2020

we have now created a notebook, which compares the two implementations (1 and 4). The conversion, as you can see at the bottom of the notebook is quite cheap, but the computation part takes about twice as long as we could archive it. 🥇
See this pdf as an example. :octocat:

Ziped Notebook

@jbesomi jbesomi marked this pull request as draft September 14, 2020 15:45
@mk2510 mk2510 marked this pull request as ready for review September 22, 2020 08:52
@mk2510
Copy link
Collaborator

mk2510 commented Sep 22, 2020

we now have merged #157 or the current master into this PR and are ready for review/merge 🐤 🙏

@henrifroese
Copy link
Collaborator Author

TODO:

  • resolve conflict
  • refine explanation of parallelization for developers, with examples
  • make sure parallelization is explained in getting-started

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants