Fix segment-wiki script #1694

menshikh-iv · 2017-11-05T16:58:13Z

What's done:

Updated field names (now it's more descriptive, instead of ts, sc);
Updated documentation about output format (available with -h option);
Added stdout support (as default option);
Removed prunning through tokenization (and comments about it).

…enization, more descriptive filed names, stdout support (as default option)

piskvorky · 2017-11-05T17:01:16Z

gensim/scripts/segment_wiki.py

-Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump and extract sections of pages from it
-and save to json-line format.
+Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump (typical filename
+is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2),


I'd add a link to the actual place to download those, because it's not obvious.

For example, the English Wiki dump is here: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

piskvorky · 2017-11-05T17:10:41Z

gensim/scripts/segment_wiki.py

-tokenizer). The package is available at https://github.com/clips/pattern .
+    'title' (str) - title of article,
+    'section_titles' (list) - list of titles of sections,
+    'section_texts' (list) - list of content from sections.


I'd prefer to include a concrete hands-on example, something like this:

Process a raw Wikipedia dump (XML.bz2 format, for example https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for the English Wikipedia) and extract all articles and their sections as plain text::

python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o enwiki-20171001-pages-articles.json.gz

The output format of the parsed plain text Wikipedia is json-lines = one article per line, serialized into JSON. Here's an example how to work with it from Python::

# iterate over the plain text file we just created for line in smart_open('enwiki-20171001-pages-articles.txt.gz'): # decode JSON into a Python object article = json.loads(line) # each article has a "title", "section_titles" and "section_texts" fields print("Article title: %s" % article['title']) for section_title, section_text in zip(article['section_titles'], article['section_texts']): print("Section title: %s" % section_title) print("Section text: %s" % section_text)

piskvorky · 2017-11-05T17:17:11Z

gensim/scripts/segment_wiki.py

-                        num_total_tokens += len(utils.lemmatize(section_content))
-                    else:
-                        num_total_tokens += len(tokenize(section_content))
-                if num_total_tokens < ARTICLE_MIN_WORDS or \


Including redirects and stubs is a bad idea. That's typically (never?) what people want, out of Wikipedia dumps.

We want to keep only meaningful articles, such as at least 500 plain text characters (~1 paragraph) or something.

I don't think so (about short articles), because we provide parsed wikipedia dump "as-is" and short articles can be useful for users for special cases (and easy to filter later if needed), for this reason, I removed this part.

But it's needed to filter trash (like the redirect), I'll add fix for this.

Sounds good 👍

Stubs are not really articles though; most of the text is something like "this article is a stub, help Wikipedia by expanding it" or something. Not terribly useful, potentially messing up corpus statistics for people who would be unaware of this.

piskvorky · 2017-11-06T08:03:17Z

gensim/scripts/segment_wiki.py

@@ -226,7 +249,9 @@ def get_texts_with_sections(self):
        for group in utils.chunkize(page_xmls, chunksize=10 * self.processes, maxsize=1):
            for article_title, sections in pool.imap(segment, group):  # chunksize=10):
                # article redirects are pruned here
-                if any(article_title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
+                if any(article_title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES) \
+                        or len(sections) == 0 \


not sections more Pythonic

piskvorky · 2017-11-06T08:26:16Z

gensim/scripts/segment_wiki.py

                    continue
+                if len(sections) == 0 or sections[0][1].lstrip().lower().startswith("#redirect"):  # filter redirect
+                    continue
+                if sum(len(body.strip()) for (_, body) in sections) < 250:  # filter very short articles (thrash)


thrash => trash ; but it's more stubs than trash.

The constant (250) should be configurable (min_article_characters?), not hardwired like this.

Fix script docstring (format description), remove pruning through tok…

68df069

…enization, more descriptive filed names, stdout support (as default option)

piskvorky requested changes Nov 5, 2017

View reviewed changes

menshikh-iv added 6 commits November 5, 2017 22:40

Add link to fresh en wiki

5ea0700

Add link to fresh en wiki, examples section, filter redirect

357f8d2

strip -> lstrip

0749351

Added time for processing & ignore empty articles

849dfc9

extend filtering section

d4f81a4

reduce threshold

a691623

piskvorky requested changes Nov 6, 2017

View reviewed changes

parametrize minimal article length, reduce to 200

3eb057a

menshikh-iv merged commit 64f9a92 into develop Nov 6, 2017

menshikh-iv deleted the fix-segment-wiki branch November 6, 2017 10:02

menshikh-iv mentioned this pull request Nov 8, 2017

Add segment-wiki script #1483

Merged

piskvorky mentioned this pull request Nov 10, 2017

[MRG] Improve segment_wiki script #1707

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix segment-wiki script #1694

Fix segment-wiki script #1694

menshikh-iv commented Nov 5, 2017

piskvorky Nov 5, 2017 •

edited

Loading

piskvorky Nov 5, 2017 •

edited

Loading

piskvorky Nov 5, 2017

menshikh-iv Nov 5, 2017

menshikh-iv Nov 5, 2017

piskvorky Nov 5, 2017 •

edited

Loading

piskvorky Nov 6, 2017

piskvorky Nov 6, 2017 •

edited

Loading

Fix segment-wiki script #1694

Fix segment-wiki script #1694

Conversation

menshikh-iv commented Nov 5, 2017

piskvorky Nov 5, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky Nov 5, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky Nov 5, 2017

Choose a reason for hiding this comment

menshikh-iv Nov 5, 2017

Choose a reason for hiding this comment

menshikh-iv Nov 5, 2017

Choose a reason for hiding this comment

piskvorky Nov 5, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky Nov 6, 2017

Choose a reason for hiding this comment

piskvorky Nov 6, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky Nov 5, 2017 •

edited

Loading

piskvorky Nov 5, 2017 •

edited

Loading

piskvorky Nov 5, 2017 •

edited

Loading

piskvorky Nov 6, 2017 •

edited

Loading