Skip to content

Commit

Permalink
Fix "generator" language in word2vec docs (#2935)
Browse files Browse the repository at this point in the history
* Fix docs about Word2Vec (fix #2934)

Docs say you can use a generator as the first argument, but you can't.

The tempfile path was also unused, so that's been removed.

* Fix langauge to make it clear streaming is supported

Technically a generator is a kind of iterator, so this clarifies that a
restartable iterator (as opposed to a consumable generator) is
necessary.

* Update gensim/models/word2vec.py

* Update CHANGELOG.md

Co-authored-by: Michael Penkov <m@penkov.dev>
  • Loading branch information
polm and mpenkov authored Sep 16, 2020
1 parent 09b7e94 commit cddf3c1
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 6 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ This release contains a major refactoring.
### :books: Tutorial and doc improvements

* Clear up LdaModel documentation - remove claim that it accepts CSC matrix as input (PR [#2832](https://github.com/RaRe-Technologies/gensim/pull/2832), [@FyzHsn](https://github.com/FyzHsn))
* Fix "generator" language in word2vec docs (PR [#2935](https://github.com/RaRe-Technologies/gensim/pull/2935), __[@polm](https://github.com/polm)__)

## :warning: 3.8.x will be the last gensim version to support Py2.7. Starting with 4.0.0, gensim will only support Py3.5 and above

Expand Down
16 changes: 10 additions & 6 deletions gensim/models/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,18 +39,22 @@
.. sourcecode:: pycon
>>> from gensim.test.utils import common_texts, get_tmpfile
>>> from gensim.test.utils import common_texts
>>> from gensim.models import Word2Vec
>>>
>>> path = get_tmpfile("word2vec.model")
>>>
>>> model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
>>> model.save("word2vec.model")
The training is streamed, meaning `sentences` can be a generator, reading input data
from disk on-the-fly, without loading the entire corpus into RAM.
It also means you can continue training the model later:
The training is streamed, so ``sentences`` can be an iterable, reading input data
from disk on-the-fly. This lets you avoid loading the entire corpus into RAM.
However, note that because the iterable must be re-startable, `sentences` must
not be a generator. For an example of an appropriate iterator see
:class:`~gensim.models.word2vec.BrownCorpus`,
:class:`~gensim.models.word2vec.Text8Corpus` or
:class:`~gensim.models.word2vec.LineSentence`.
If you save the model you can continue training it later:
.. sourcecode:: pycon
Expand Down

0 comments on commit cddf3c1

Please sign in to comment.