-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add interlinks to segment_wiki #1712
Comments
Thanks for creating this issue. Another suggestion would be to include the span of the matched text with the begin offset and the end position. This will result in getting a segmented corpus for free based on your technique. It can be later used to tokenize the section text with link text items as a single unit. So the final format may look like:
|
The corpora and code included with gensim are restricted to topic modelling and unsupervised text processing. We're not aiming to be "everything for everybody". Including other types of information (supervised labels, graph structure) is possible but needs to be clearly motivated. @napsternxg how would you use this extra information? What is the intended application? |
@piskvorky I understand the requirement to for gensim being focused on topic modelling and unsupervised text processing. The major application area is utilizing multi-word units in Wikipedia which are usually linked to other wiki pages - as components of topic models and other text processing. E.g. simple tokenization will split words like "Barack Obama" or "Natural Language Processing". Although, there is support for extracting Ngrams using the Phrases module, a more principled approach when processing wiki pages would be to identify these phrases as a single concept (which is very easy to do for Wikipedia). Mapping the wiki link text to the wiki page would allow for normalizing these phrases to a common concept in Wikipedia. E.g. LDA is both "Latent Dirichlet allocation" and "Linear Discriminant Analysis". This will help in reducing the vocabulary size. Finally, the motivation to allow offset and end values in the json data, was to help in overriding tokenization flaws, especially with biomedical and chemical names. These were the use cases I had in mind. I would be happy to see this feature since I have been quite impressed with the processing speed of algorithms in gensim, and the wikipedia dump parser appears to be very fast. Another alternative would be to use the |
Thank you @napsternxg, maybe you'll try to implement this feature, this will be great! |
I can have a look at it after December 15th. Will send a PR then. |
Hey @napsternxg @piskvorky any opinions? |
Thank you for the explanation, that makes sense. I don't think identifying the interlink location down to a section is critical. But the voice of people who actually use this feature is more important than mine -- do you think the section is important? What are the pros/cons? |
@steremma this is great thanks for adding this in. My usecase was being able to identify the multi word unit in the text along with what wiki it points to. But I don't think the current approach may be able to take care of this as the current approach removes that information and only retains the link to the wiki. If we can also have the interlink text and identify what wiki it points to that would help in training multi word word vectors more effectively. But this approach is also quite useful as we can just include the interwiki links as document tags and train the the document embeddings with that information. |
@steremma it's possible to do that @napsternxg suggested? |
I am manually checking sample wiki pages in our test set and it appears that in most cases the text link is exactly the same as the title it points to. There are a few cases where the text is altered a little bit. So adding this map would show an output mostly like this:
EDIT: It can be easily done by adding another boolean argument to |
Done, please check updated PR |
…Fix piskvorky#1712 (piskvorky#1839) * promoting the markup gives up information needed to find the intelinks * Add interlinks to the output of `segment_wiki` * New output format is (str, list of (str, str), list of str, reflecting structure (title, [(section_heading, section_content), ...], [interlink, ...]) * `filter_wiki` in WikiCorpus will not promote uncaught markup to plain text as this will give up valuable information for the interlink discovery * Fixed PEP 8 * Refactoring identation and variable names * Removed debugging code from script * Fixed a bug where interlinks with a description or multiple names where disregarded * Due to preprocessing in `filter_wiki` interlinks containing alternative names had one of the 2 `[` and `]` characters removed. The regex now takes that into account. * Now stripping whitespace off section titles * Unit test `gensim.scripts.segment_wiki` * Initiate unit testing for all scripts. * Check for expected len given article filtering (namespace, size in characters and redirections). * Check for yielded title, section headings and texts as well as interlinks yielded from generator. * Check that the same is correctly persisted in JSON. * Fix PEP 8 * Fix Python 3.5 compatibility * Section text now completely clean from wiki markup * Refactored filtering functions in ``wikicorpus.py` so that uncaught markup can be optionally promoted to plain text * Interlink extraction logic moved to `wikicorpus.py` * Unit tests modified accordingly * Added extra logging info to troublehsoot weird Travis behavior * Fix PEP 8 * pin workers for segment_and_write_all_articles * Get rid of debugging stuff * Get rid of global logger * Interlinks are now mapping from the linked article's title to the actual interlink text * Used boolean argument with default argument in `filter_wiki`. The default value keeps the old functionality so that existing code does not brake * Overriding the default argument causes interlinks to not be simplified and lets `find_interlinks` create the mappings * Moved regex outside function * Interlink extraction is now optional and controlled with the `-i` command line argument * PEP 8 long lines * made scripts tests aware of the optional interlinks argument * Updated script help output for interlinks
Idea
Users ask about this feature, this is really useful to have interlinks in the dump to construct the graph of articles or use relation between articles in any way.
What's need to implement
Add field
"section_interlinks" (list of str)
that contains a list of article titles referenced by this section.The text was updated successfully, but these errors were encountered: