Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up preprocessing module #124

Closed

Conversation

mk2510
Copy link
Collaborator

@mk2510 mk2510 commented Jul 26, 2020

Looked at every function in the preprocessing module 😑 The only function we found, where it makes sense to speed up was the clean function 🧹 with the default pipeline. This is probably called with the default pipeline very often. The default pipeline loops through every function and we combined these into one pass in _optimised_default_clean_single_cell(text: str), which is 30% faster than the default pipeline. Now when users call clean with the default pipeline, this function will be used.
We understand, that this introduces code duality and more maintenance if bugs were fixed and changes to the default pipeline introduced, but we believe it makes sense, as the function is called so often. 🦝

speeded up the default function, by writing it in just one and let it operate on strings


Co-authored-by: Henri Froese <hf2000510@gmail.com>
@jbesomi
Copy link
Owner

jbesomi commented Jul 27, 2020

Nice, thank you! 🎉

Review:

  1. What about get_default_pipeline? We can either keep it (but why?), remove it or rename _optimised_default_clean -> get_default_pipeline?
  2. We need to extensively unit-test it as this will be one of the most used function
  3. ... what if we have global constants regex?
  4. What if we update the docstring of clean explaining the new changes?

removed the regex pattern from the functions and placed them in an constant above



Co-authored-by: Henri Froese <hf2000510@gmail.com>
changed Docstring



Co-authored-by: Henri Froese <hf2000510@gmail.com>
@mk2510
Copy link
Collaborator Author

mk2510 commented Jul 27, 2020

Review:
1.What about get_default_pipeline? We can either keep it (but why?), remove it or rename _optimised_default_clean -> get_default_pipeline?

  1. We need to extensively unit-test it as this will be one of the most used function
  2. what if we have global constants regex?
    4.What if we update the docstring of clean explaining the new changes?

I have just updated the files to include those changes. 😅

@@ -17,6 +17,34 @@

from typing import List, Callable

# REGEX pattern constants
PATTERN_REMOVE_DIGITS_BLOCK = r"\b\d+\b"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, the words REMOVE, REPLACE and PATTERN_ are not necessary:

PATTERN_REMOVE_DIGITS_BLOCK -> DIGITS_BLOCK
PATERN_REMOVE_CURLY_BRACKETS -> CURLY_BRACKETS
...

"""


def GET_PATTERN_TOKENIZATION(punct: str) -> str:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global functions are always lowercased
"Returns the standart tokenisation pattern": not particularly meaningful
"Returns" -> "Return"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thanks 🥇 I was unsure about the python style guide. But now it is a private function

@mk2510 mk2510 marked this pull request as ready for review July 29, 2020 06:37
@mk2510
Copy link
Collaborator Author

mk2510 commented Jul 29, 2020

I just have implemented those changes. Thx for the comments on python style guides 👍
Also fixed some bigger merge problems in the preprocessing module

@mk2510 mk2510 requested a review from jbesomi July 29, 2020 12:12
@mk2510 mk2510 added the enhancement New feature or request label Aug 3, 2020
@jbesomi
Copy link
Owner

jbesomi commented Aug 5, 2020

The new clean solution makes use of apply. But, in some cases (for instance with remove punctuation) we might not need to use apply, rather prefer s.str.replace(REGEX), no?

I'm OK in improving the clean function but first I think we need to have a better understanding of which solution is actually the fastest, i.e we need to benchmark a bit.

It would be great if you can do some benchmark and prepare some reports/insights, maybe as a single Jupiter notebook that you can attach here.

A possible benchmark pipeline might be:

  1. apply(re.sub) vs str.replace
  2. your version vs old version
  3. go even further and try to parallelize more (with flashtext flashtext integration #87 for instance?)

We can also split this PR in two:

  1. Add REGEX constants
  2. Benchmark clean + add clean + add unit tests

texthero/preprocessing.py Show resolved Hide resolved

def _get_pattern_for_tokenisation(punct: str) -> str:
"""
Return the standart tokenisation pattern
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

word standart does not mean anything for most of the users ...
we should come up with a better explanation; i.e mentioning both what does this function does and why, which problem it solves

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the comment. I hope it will be clearer now 🐰

Test clean
"""

def _get_default_clean_pipeline(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we really need so many tests for this part ...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those test will help us to cover all different sections of the pipeline individually, so if something gets changed, we know, which part is broken

@mk2510
Copy link
Collaborator Author

mk2510 commented Aug 6, 2020

apply(re.sub) vs str.replace
your version vs old version

@jbesomi I have created here a notebook https://colab.research.google.com/drive/1HeVzomOS2F962qxTxKu_W__e1HaIfvZq?usp=sharing which presents the time differences.

go even further and try to parallelize more (with flashtext #87 for instance?)

as far, as I understood does the linked pull bring fasttext to our library. In my experience fasttext will be faster with huge regex pattern and not with the small ones we still have. So I think, this integration might be more part of the other issue 🏎️

@henrifroese and I will start with the multithreading task today, which will give general improvements. This pull request should just improve the functions by more efficient code 🤓

@vercel vercel bot temporarily deployed to Preview August 6, 2020 15:41 Inactive
@mk2510
Copy link
Collaborator Author

mk2510 commented Aug 6, 2020

apply(re.sub) vs str.replace

we need to do the re.sub as str.replace does not take any regex expressions when performed on a string.

@henrifroese
Copy link
Collaborator

henrifroese commented Aug 23, 2020

After doing more speed comparisons here, we have noticed that

  • the new clean function is actually not faster after all -> don't change it
  • str.replace is as fast as apply(re.sub) -> don't change it

So we're closing 🚪 this 🦀 🌵 😿

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants