-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up preprocessing module #124
Speed up preprocessing module #124
Conversation
speeded up the default function, by writing it in just one and let it operate on strings Co-authored-by: Henri Froese <hf2000510@gmail.com>
Nice, thank you! 🎉 Review:
|
removed the regex pattern from the functions and placed them in an constant above Co-authored-by: Henri Froese <hf2000510@gmail.com>
changed Docstring Co-authored-by: Henri Froese <hf2000510@gmail.com>
I have just updated the files to include those changes. 😅 |
texthero/preprocessing.py
Outdated
@@ -17,6 +17,34 @@ | |||
|
|||
from typing import List, Callable | |||
|
|||
# REGEX pattern constants | |||
PATTERN_REMOVE_DIGITS_BLOCK = r"\b\d+\b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, the words REMOVE
, REPLACE
and PATTERN_
are not necessary:
PATTERN_REMOVE_DIGITS_BLOCK -> DIGITS_BLOCK
PATERN_REMOVE_CURLY_BRACKETS -> CURLY_BRACKETS
...
texthero/preprocessing.py
Outdated
""" | ||
|
||
|
||
def GET_PATTERN_TOKENIZATION(punct: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Global functions are always lowercased
"Returns the standart tokenisation pattern": not particularly meaningful
"Returns" -> "Return"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah thanks 🥇 I was unsure about the python style guide. But now it is a private function
I just have implemented those changes. Thx for the comments on python style guides 👍 |
The new I'm OK in improving the clean function but first I think we need to have a better understanding of which solution is actually the fastest, i.e we need to benchmark a bit. It would be great if you can do some benchmark and prepare some reports/insights, maybe as a single Jupiter notebook that you can attach here. A possible benchmark pipeline might be:
We can also split this PR in two:
|
texthero/preprocessing.py
Outdated
|
||
def _get_pattern_for_tokenisation(punct: str) -> str: | ||
""" | ||
Return the standart tokenisation pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
word standart
does not mean anything for most of the users ...
we should come up with a better explanation; i.e mentioning both what does this function does and why, which problem it solves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the comment. I hope it will be clearer now 🐰
Test clean | ||
""" | ||
|
||
def _get_default_clean_pipeline(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we really need so many tests for this part ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those test will help us to cover all different sections of the pipeline individually, so if something gets changed, we know, which part is broken
@jbesomi I have created here a notebook https://colab.research.google.com/drive/1HeVzomOS2F962qxTxKu_W__e1HaIfvZq?usp=sharing which presents the time differences.
as far, as I understood does the linked pull bring fasttext to our library. In my experience fasttext will be faster with huge regex pattern and not with the small ones we still have. So I think, this integration might be more part of the other issue 🏎️ @henrifroese and I will start with the multithreading task today, which will give general improvements. This pull request should just improve the functions by more efficient code 🤓 |
we need to do the re.sub as str.replace does not take any regex expressions when performed on a string. |
After doing more speed comparisons here, we have noticed that
So we're closing 🚪 this 🦀 🌵 😿 |
Looked at every function in the preprocessing module 😑 The only function we found, where it makes sense to speed up was the clean function 🧹 with the default pipeline. This is probably called with the default pipeline very often. The default pipeline loops through every function and we combined these into one pass in
_optimised_default_clean_single_cell(text: str)
, which is 30% faster than the default pipeline. Now when users call clean with the default pipeline, this function will be used.We understand, that this introduces code duality and more maintenance if bugs were fixed and changes to the default pipeline introduced, but we believe it makes sense, as the function is called so often. 🦝