-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix segment-wiki script #1694
Fix segment-wiki script #1694
Conversation
…enization, more descriptive filed names, stdout support (as default option)
gensim/scripts/segment_wiki.py
Outdated
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump and extract sections of pages from it | ||
and save to json-line format. | ||
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump (typical filename | ||
is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a link to the actual place to download those, because it's not obvious.
For example, the English Wiki dump is here: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
gensim/scripts/segment_wiki.py
Outdated
tokenizer). The package is available at https://github.com/clips/pattern . | ||
'title' (str) - title of article, | ||
'section_titles' (list) - list of titles of sections, | ||
'section_texts' (list) - list of content from sections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to include a concrete hands-on example, something like this:
Process a raw Wikipedia dump (XML.bz2 format, for example https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for the English Wikipedia) and extract all articles and their sections as plain text::
python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o enwiki-20171001-pages-articles.json.gz
The output format of the parsed plain text Wikipedia is json-lines = one article per line, serialized into JSON. Here's an example how to work with it from Python::
# iterate over the plain text file we just created
for line in smart_open('enwiki-20171001-pages-articles.txt.gz'):
# decode JSON into a Python object
article = json.loads(line)
# each article has a "title", "section_titles" and "section_texts" fields
print("Article title: %s" % article['title'])
for section_title, section_text in zip(article['section_titles'], article['section_texts']):
print("Section title: %s" % section_title)
print("Section text: %s" % section_text)
num_total_tokens += len(utils.lemmatize(section_content)) | ||
else: | ||
num_total_tokens += len(tokenize(section_content)) | ||
if num_total_tokens < ARTICLE_MIN_WORDS or \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including redirects and stubs is a bad idea. That's typically (never?) what people want, out of Wikipedia dumps.
We want to keep only meaningful articles, such as at least 500 plain text characters (~1 paragraph) or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so (about short articles), because we provide parsed wikipedia dump "as-is" and short articles can be useful for users for special cases (and easy to filter later if needed), for this reason, I removed this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's needed to filter trash (like the redirect), I'll add fix for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍
Stubs are not really articles though; most of the text is something like "this article is a stub, help Wikipedia by expanding it" or something. Not terribly useful, potentially messing up corpus statistics for people who would be unaware of this.
gensim/scripts/segment_wiki.py
Outdated
@@ -226,7 +249,9 @@ def get_texts_with_sections(self): | |||
for group in utils.chunkize(page_xmls, chunksize=10 * self.processes, maxsize=1): | |||
for article_title, sections in pool.imap(segment, group): # chunksize=10): | |||
# article redirects are pruned here | |||
if any(article_title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES): | |||
if any(article_title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES) \ | |||
or len(sections) == 0 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sections
more Pythonic
gensim/scripts/segment_wiki.py
Outdated
continue | ||
if len(sections) == 0 or sections[0][1].lstrip().lower().startswith("#redirect"): # filter redirect | ||
continue | ||
if sum(len(body.strip()) for (_, body) in sections) < 250: # filter very short articles (thrash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thrash
=> trash
; but it's more stubs than trash.
The constant (250) should be configurable (min_article_characters
?), not hardwired like this.
What's done:
ts
,sc
);-h
option);