forked from piskvorky/gensim
-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Change find_interlinks return type to list of tuples (piskvorky#2636)
* Change interlinks format to list of tuples. Fixes piskvorky#2635 This commit fixes the issue in piskvorky#2635 This commit changes the interlinks storage in the `segment_wiki` script from dictionary to a list of tuples. We can process the test wikidata used in the test suite of gensim to inspect the new behavior. ``` python gensim/scripts/segment_wiki.py -i \ -f ~/Downloads/enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2 \ -o ~/Downloads/enwiki-latest.json.gz ``` We get the following output: ``` $ cat ~/Downloads/enwiki-latest.json.gz | zcat | head -1 | jq -r '.interlinks[] | [.[0], .[1]] | @TSV' | sort | head -ism -ism 1848 Revolution 1848 Revolution 1917 October Revolution 1917 October Revolution 6 February 1934 crisis February 1934 riots A. S. Neill A. S. Neill AK Press AK Press Abu Hanifa Abu Hanifa Adolf Brand Adolf Brand Adolf Brand Adolf Brand Adolf Hitler Hitler ``` All tests pass for the related test file. ``` python -m unittest gensim.test.test_scripts /Users/smishra/miniconda3/envs/TwitterNER/lib/python3.7/bz2.py:131: ResourceWarning: unclosed file <_io.BufferedReader name='/Users/smishra/workspace/codes/python/gensim/gensim/test/test_data/enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2'> self._buffer = None ResourceWarning: Enable tracemalloc to get the object allocation traceback ..... ---------------------------------------------------------------------- Ran 5 tests in 6.298s OK ``` * Updated docstrings * Fixed flake8 issue of long line in docsrtring * Fixed comments and replaces assertTrue with assertEqual * Fixed unittest comment and checks for wikicorpus
- Loading branch information
1 parent
3e027c2
commit e102574
Showing
3 changed files
with
26 additions
and
19 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters