Speed up preprocessing module #124

mk2510 · 2020-07-26T20:58:12Z

Looked at every function in the preprocessing module 😑 The only function we found, where it makes sense to speed up was the clean function 🧹 with the default pipeline. This is probably called with the default pipeline very often. The default pipeline loops through every function and we combined these into one pass in _optimised_default_clean_single_cell(text: str), which is 30% faster than the default pipeline. Now when users call clean with the default pipeline, this function will be used.
We understand, that this introduces code duality and more maintenance if bugs were fixed and changes to the default pipeline introduced, but we believe it makes sense, as the function is called so often. 🦝

speeded up the default function, by writing it in just one and let it operate on strings Co-authored-by: Henri Froese <hf2000510@gmail.com>

jbesomi · 2020-07-27T16:47:52Z

Nice, thank you! 🎉

Review:

What about get_default_pipeline? We can either keep it (but why?), remove it or rename _optimised_default_clean -> get_default_pipeline?
We need to extensively unit-test it as this will be one of the most used function
... what if we have global constants regex?
What if we update the docstring of clean explaining the new changes?

removed the regex pattern from the functions and placed them in an constant above Co-authored-by: Henri Froese <hf2000510@gmail.com>

changed Docstring Co-authored-by: Henri Froese <hf2000510@gmail.com>

mk2510 · 2020-07-27T18:16:48Z

Review:
1.What about get_default_pipeline? We can either keep it (but why?), remove it or rename _optimised_default_clean -> get_default_pipeline?

We need to extensively unit-test it as this will be one of the most used function
what if we have global constants regex?
4.What if we update the docstring of clean explaining the new changes?

I have just updated the files to include those changes. 😅

jbesomi · 2020-07-27T20:36:28Z

texthero/preprocessing.py

@@ -17,6 +17,34 @@

 from typing import List, Callable

+# REGEX pattern constants
+PATTERN_REMOVE_DIGITS_BLOCK = r"\b\d+\b"


Probably, the words REMOVE, REPLACE and PATTERN_ are not necessary:

PATTERN_REMOVE_DIGITS_BLOCK -> DIGITS_BLOCK
PATERN_REMOVE_CURLY_BRACKETS -> CURLY_BRACKETS
...

jbesomi · 2020-07-27T20:37:40Z

texthero/preprocessing.py

+                            """
+
+
+def GET_PATTERN_TOKENIZATION(punct: str) -> str:


Global functions are always lowercased
"Returns the standart tokenisation pattern": not particularly meaningful
"Returns" -> "Return"

ah thanks 🥇 I was unsure about the python style guide. But now it is a private function

…ssing

mk2510 · 2020-07-29T06:39:34Z

I just have implemented those changes. Thx for the comments on python style guides 👍
Also fixed some bigger merge problems in the preprocessing module

jbesomi · 2020-08-05T20:29:05Z

The new clean solution makes use of apply. But, in some cases (for instance with remove punctuation) we might not need to use apply, rather prefer s.str.replace(REGEX), no?

I'm OK in improving the clean function but first I think we need to have a better understanding of which solution is actually the fastest, i.e we need to benchmark a bit.

It would be great if you can do some benchmark and prepare some reports/insights, maybe as a single Jupiter notebook that you can attach here.

A possible benchmark pipeline might be:

apply(re.sub) vs str.replace
your version vs old version
go even further and try to parallelize more (with flashtext flashtext integration #87 for instance?)

We can also split this PR in two:

Add REGEX constants
Benchmark clean + add clean + add unit tests

texthero/preprocessing.py

jbesomi · 2020-08-05T20:25:14Z

texthero/preprocessing.py

+
+def _get_pattern_for_tokenisation(punct: str) -> str:
+    """
+    Return the standart tokenisation pattern


word standart does not mean anything for most of the users ...
we should come up with a better explanation; i.e mentioning both what does this function does and why, which problem it solves

I have updated the comment. I hope it will be clearer now 🐰

jbesomi · 2020-08-05T20:26:12Z

tests/test_preprocessing.py

+    Test clean
+    """
+
+    def _get_default_clean_pipeline(self):


Not sure we really need so many tests for this part ...?

I think those test will help us to cover all different sections of the pipeline individually, so if something gets changed, we know, which part is broken

…ssing

mk2510 · 2020-08-06T15:25:13Z

apply(re.sub) vs str.replace
your version vs old version

@jbesomi I have created here a notebook https://colab.research.google.com/drive/1HeVzomOS2F962qxTxKu_W__e1HaIfvZq?usp=sharing which presents the time differences.

go even further and try to parallelize more (with flashtext #87 for instance?)

as far, as I understood does the linked pull bring fasttext to our library. In my experience fasttext will be faster with huge regex pattern and not with the small ones we still have. So I think, this integration might be more part of the other issue 🏎️

@henrifroese and I will start with the multithreading task today, which will give general improvements. This pull request should just improve the functions by more efficient code 🤓

mk2510 · 2020-08-06T15:45:06Z

apply(re.sub) vs str.replace

we need to do the re.sub as str.replace does not take any regex expressions when performed on a string.

henrifroese · 2020-08-23T08:58:45Z

After doing more speed comparisons here, we have noticed that

the new clean function is actually not faster after all -> don't change it
str.replace is as fast as apply(re.sub) -> don't change it

So we're closing 🚪 this 🦀 🌵 😿

Speed up default clean function

3ab14fd

speeded up the default function, by writing it in just one and let it operate on strings Co-authored-by: Henri Froese <hf2000510@gmail.com>

vercel bot deployed to Preview July 26, 2020 20:58 View deployment

Regex pattern to constancs

e0f02c5

removed the regex pattern from the functions and placed them in an constant above Co-authored-by: Henri Froese <hf2000510@gmail.com>

vercel bot deployed to Preview July 27, 2020 17:18 View deployment

changed clean function docstring

2ea3caf

changed Docstring Co-authored-by: Henri Froese <hf2000510@gmail.com>

vercel bot deployed to Preview July 27, 2020 17:35 View deployment

mk2510 added 2 commits July 27, 2020 19:50

removed default pipeline for cleaning

697a229

added unittest for the clean function

4bb9860

vercel bot deployed to Preview July 27, 2020 17:50 View deployment

mk2510 added 2 commits July 27, 2020 20:14

format file

57c37c1

added unit test for clean function

db4934e

vercel bot deployed to Preview July 27, 2020 18:15 View deployment

jbesomi requested changes Jul 27, 2020

View reviewed changes

mk2510 added 2 commits July 29, 2020 08:08

updated naming schema

5887485

improved comment on helper function _get_pattern_for_tokenasiton()

addc23b

vercel bot deployed to Preview July 29, 2020 06:21 View deployment

fixed spelling mistake

84652ee

vercel bot deployed to Preview July 29, 2020 06:22 View deployment

Merge remote-tracking branch 'upstream/master' into speed_up_preproce…

3e7056b

…ssing

vercel bot deployed to Preview July 29, 2020 06:27 View deployment

fixed some merge conflicts

16b775f

vercel bot deployed to Preview July 29, 2020 06:37 View deployment

mk2510 marked this pull request as ready for review July 29, 2020 06:37

mk2510 requested a review from jbesomi July 29, 2020 12:12

mk2510 added the enhancement New feature or request label Aug 3, 2020

jbesomi reviewed Aug 5, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into speed_up_preproce…

0b41020

…ssing

vercel bot deployed to Preview August 6, 2020 14:03 View deployment

changed comment

efdc3c9

vercel bot deployed to Preview August 6, 2020 14:42 View deployment

replaced sub with replace

e627126

vercel bot deployed to Preview August 6, 2020 15:17 View deployment

changed back to re.sub

f4a91fa

vercel bot temporarily deployed to Preview August 6, 2020 15:41 Inactive

updated format

d4d394b

vercel bot deployed to Preview August 6, 2020 15:42 View deployment

henrifroese closed this Aug 23, 2020

henrifroese mentioned this pull request Aug 23, 2020

Speed-Up Preprocessing + NLP #162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up preprocessing module #124

Speed up preprocessing module #124

mk2510 commented Jul 26, 2020

jbesomi commented Jul 27, 2020

mk2510 commented Jul 27, 2020

jbesomi Jul 27, 2020

jbesomi Jul 27, 2020

mk2510 Jul 29, 2020

mk2510 commented Jul 29, 2020

jbesomi commented Aug 5, 2020

jbesomi Aug 5, 2020

mk2510 Aug 6, 2020

jbesomi Aug 5, 2020

mk2510 Aug 6, 2020

mk2510 commented Aug 6, 2020

mk2510 commented Aug 6, 2020

henrifroese commented Aug 23, 2020 •

edited

Loading

Speed up preprocessing module #124

Speed up preprocessing module #124

Conversation

mk2510 commented Jul 26, 2020

jbesomi commented Jul 27, 2020

mk2510 commented Jul 27, 2020

jbesomi Jul 27, 2020

Choose a reason for hiding this comment

jbesomi Jul 27, 2020

Choose a reason for hiding this comment

mk2510 Jul 29, 2020

Choose a reason for hiding this comment

mk2510 commented Jul 29, 2020

jbesomi commented Aug 5, 2020

jbesomi Aug 5, 2020

Choose a reason for hiding this comment

mk2510 Aug 6, 2020

Choose a reason for hiding this comment

jbesomi Aug 5, 2020

Choose a reason for hiding this comment

mk2510 Aug 6, 2020

Choose a reason for hiding this comment

mk2510 commented Aug 6, 2020

mk2510 commented Aug 6, 2020

henrifroese commented Aug 23, 2020 • edited Loading

henrifroese commented Aug 23, 2020 •

edited

Loading