-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update rules for removing table markup. Fix #1710 #1954
Conversation
gensim/corpora/wikicorpus.py
Outdated
@@ -185,10 +186,14 @@ def remove_markup(text, promote_remaining=True, simplify_links=True): | |||
if simplify_links: | |||
text = re.sub(RE_P6, '\\2', text) # simplify links, keep description only | |||
# remove table markup | |||
|
|||
text = text.replace('||', '\n|') # each table cell on a separate line | |||
text = text.replace("!!", "\n|") # leave only cell content(in table head) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's a performance difference after this changes?
@@ -645,6 +645,18 @@ def test_max_token_len_set(self): | |||
corpus = self.corpus_class(self.enwiki, processes=1, token_max_len=16, lemmatize=False) | |||
self.assertTrue(u'collectivization' in next(corpus.get_texts())) | |||
|
|||
def test_removed_table_markup(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like CI stuck on this test (and this failed), please debug it.
ping @chaitaliSaini, how is going? |
Time taken before changes: 2hrs, after changes: 2hrs 22mins. |
gensim/corpora/wikicorpus.py
Outdated
"""Table cell formatting.""" | ||
RE_P14 = re.compile(r'\[\[Category:[^][]*\]\]', re.UNICODE) | ||
"""Categories.""" | ||
RE_P15 = re.compile(r'\[\[([fF]ile:|[iI]mage)[^]]*(\]\])', re.UNICODE) | ||
"""Remove File and Image templates.""" | ||
RE_P16 = re.compile(r'\[{2}(.*?)\]{2}', re.UNICODE) | ||
"""Capture interlinks text and article linked""" | ||
|
||
RE_P17 = re.compile(r'(\n.{0,4}((bgcolor)|(\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=)|(scope=))(.*))|(^.{0,2}((bgcolor)|(\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=))(.*))', re.UNICODE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: too long line https://travis-ci.org/RaRe-Technologies/gensim/jobs/351361148#L510, please fix this (we using 120 characters limit)
@piskvorky this change ~20% slowdown |
Of course. Correctness comes before speed. @chaitaliSaini is this fix complete, or are there still markup situations where the parsing fails? |
@piskvorky 👍 |
I compared results before and after this change
LGTM, good work @chaitaliSaini 👍 |
Updated the regex rules for removing table markup.