Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segment-wiki script #1694

Merged
merged 8 commits into from
Nov 6, 2017
Merged

Fix segment-wiki script #1694

merged 8 commits into from
Nov 6, 2017

Conversation

menshikh-iv
Copy link
Contributor

What's done:

  • Updated field names (now it's more descriptive, instead of ts, sc);
  • Updated documentation about output format (available with -h option);
  • Added stdout support (as default option);
  • Removed prunning through tokenization (and comments about it).

…enization, more descriptive filed names, stdout support (as default option)
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump and extract sections of pages from it
and save to json-line format.
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump (typical filename
is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2),
Copy link
Owner

@piskvorky piskvorky Nov 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a link to the actual place to download those, because it's not obvious.

For example, the English Wiki dump is here: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

tokenizer). The package is available at https://github.com/clips/pattern .
'title' (str) - title of article,
'section_titles' (list) - list of titles of sections,
'section_texts' (list) - list of content from sections.
Copy link
Owner

@piskvorky piskvorky Nov 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to include a concrete hands-on example, something like this:


Process a raw Wikipedia dump (XML.bz2 format, for example https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for the English Wikipedia) and extract all articles and their sections as plain text::

python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o enwiki-20171001-pages-articles.json.gz

The output format of the parsed plain text Wikipedia is json-lines = one article per line, serialized into JSON. Here's an example how to work with it from Python::

# iterate over the plain text file we just created
for line in smart_open('enwiki-20171001-pages-articles.txt.gz'):
    # decode JSON into a Python object
    article = json.loads(line)

    # each article has a "title", "section_titles" and "section_texts" fields
    print("Article title: %s" % article['title']) 
    for section_title, section_text in zip(article['section_titles'], article['section_texts']):
        print("Section title: %s" % section_title)
        print("Section text: %s" % section_text)

num_total_tokens += len(utils.lemmatize(section_content))
else:
num_total_tokens += len(tokenize(section_content))
if num_total_tokens < ARTICLE_MIN_WORDS or \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including redirects and stubs is a bad idea. That's typically (never?) what people want, out of Wikipedia dumps.

We want to keep only meaningful articles, such as at least 500 plain text characters (~1 paragraph) or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so (about short articles), because we provide parsed wikipedia dump "as-is" and short articles can be useful for users for special cases (and easy to filter later if needed), for this reason, I removed this part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's needed to filter trash (like the redirect), I'll add fix for this.

Copy link
Owner

@piskvorky piskvorky Nov 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍

Stubs are not really articles though; most of the text is something like "this article is a stub, help Wikipedia by expanding it" or something. Not terribly useful, potentially messing up corpus statistics for people who would be unaware of this.

@@ -226,7 +249,9 @@ def get_texts_with_sections(self):
for group in utils.chunkize(page_xmls, chunksize=10 * self.processes, maxsize=1):
for article_title, sections in pool.imap(segment, group): # chunksize=10):
# article redirects are pruned here
if any(article_title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
if any(article_title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES) \
or len(sections) == 0 \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sections more Pythonic

continue
if len(sections) == 0 or sections[0][1].lstrip().lower().startswith("#redirect"): # filter redirect
continue
if sum(len(body.strip()) for (_, body) in sections) < 250: # filter very short articles (thrash)
Copy link
Owner

@piskvorky piskvorky Nov 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thrash => trash ; but it's more stubs than trash.

The constant (250) should be configurable (min_article_characters?), not hardwired like this.

@menshikh-iv menshikh-iv merged commit 64f9a92 into develop Nov 6, 2017
@menshikh-iv menshikh-iv deleted the fix-segment-wiki branch November 6, 2017 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants