Implement `shrink_windows` argument for Word2Vec. #3169

M-Demay · 2021-06-09T15:22:35Z

Fixes #3159.
The idea is to allow the user to use a window with fixed-size, contrary to the current default behaviour which is to pick the window size in [1, window] uniformly at random. We want this alternative behaviour for a project.
We did it by adding a boolean parameter reduced_windows. For compatibility, we put it at the end of the arguments list, default to False.
The user can set it directly in the constructor of the class Word2Vec, or at any further step (for example, to retrain the model in a second step), since we propagated it to the downstream calls.

Co-Authored-By: Mathis Demay <mathis.demay.md@gmail.com>

gensim/models/word2vec_inner.pyx

gensim/models/word2vec_corpusfile.pyx

gensim/models/word2vec.py

gensim/models/word2vec_corpusfile.pyx

pandrey-fr · 2021-06-14T15:58:51Z

Tests are failing on old models' reloading due to the added shrink_windows attribute. I will try to understand where we need to edit the code to add the attribute, although I do not quite understand how old models' reloading work - apparently it does not run things through the current version's object instantiation process? Otherwise the newly-introduced parameter should be added to the loaded models (with a default value of True that implements the same behaviour as the legacy one).

gojomo · 2021-06-15T01:48:22Z

In general the right place to path older models, upon load, to ensure they have all currently-required properties is _load_specials() - and a new parameter should be assigned the value that would preserve the prior behavior. See for example all the current check-and-add-if-missing logic:

https://github.com/RaRe-Technologies/gensim/blob/7b6e73a7dd2bf6e74c6b7099c02fe49d282156ea/gensim/models/word2vec.py#L1955-L1960

Separately: it occurs to me that it should be checked that the new option will work in Word2Vec's closely-related subclasses, Doc2Vec and FastText, as well. (It will likely require similar changes in their .pyx files, and perhaps some attention to their __init__() methods.)

pandrey-fr · 2021-06-15T21:39:25Z

Thank you for clarifying things about old models' reloading; it is now fixed.

I also deployed the parameter to Doc2Vec and FastText (both exposing it as part of their Python API and implementing similar tweaks to the backend pyx files as for word2vec). Otherwise the behaviour would have stayed coherent with the legacy shrink_windows=True, but it does feel right to have it properly implemented by children models!

On my Linux machine all tests passed, although I am wondering whether it would be appropriate to tweak those as well, so that some test models may also run with the non-default shrink_windows=False parameter? Apart from the code compiling I did check manually that it worked for Word2Vec but not for the other models. Not that I have actual doubts at this point about the implementation working (although I may of course be mistaken), but perhaps adding tests would be suited for future maintainability.
Do you have any opinion on that matter, and if you do want tests to be added, could you please tell me what you would expect? The simplest seems to be to add specific tests just to check that instantiation/training runs with the non-default parameter value, but I can't think of a simple way to additionally assert that the expected backend behaviour is indeed implemented.

gojomo · 2021-06-16T17:49:02Z

Yes, it'd be good to ensure at least one test method, for each class, tries the non-default shrink_windows=False code path.

If there were an easy way to test that it's having the desired effect, that'd be great - but it's a subtle thing to test. Its effects on final vectors aren't easily characterizable. The tangible external effect will be: training runs about twice as long. But, running some training twice, once each way, timing it, and asserting one was "around" twice as long seems a bit flaky & low-value to me, compared to the amount of runtime it'd add to the automated testing.

Because of that, I think careful review of the source for intent, along with the old path definitely still working as before, and the new path not erroring, is enough for the unit-testing of this.

It might make sense to mark the doc-comment for such a new, niche feature as: "New in 4.1. Experimental."

* `c.reduced_windows[:] = 0` syntax is not supported; as a consequence, all zero-value assignments due to `shrink_windows=False` have been rewritten as `for` loops. * Those changes have now been tested (cf. next commit). * NOTE: another way to proceed could be to initialize the `reduced_windows` array with zeros as values; it would then only be altered when `shrink_windows=True`.

pandrey-fr · 2021-06-17T09:28:10Z

Alright, I just pushed three additional commits:

(1) Backend changes; as it turns out, the previously-suggested use of c.reduced_windows[:] = 0 instead of a for loop is not working (it compiled but failed); I also tried using c.reduced_windows.fill(0) but it turns out that in the pyx files, the arrays are treated as lists instead of numpy arrays, so either approach failed. I switched back to for loops; it adds a few lines of (readable) code but it does work.

(2) Additional tests, following @gojomo's recommendations. I actually added 4 tests to both Word2Vec, Doc2Vec and FastText so as to test all underlying code branches (cbow and skipgram, on an iterable corpus or on a corpus file). If you would rather remove some of these to limit the extra testing time, feel free to trim some off.

(3) As you suggested, I added the "experimental" status line to the parameter's docstring in all three models.

On the side, I note that:

Most Doc2vec tests labeled to be running from a corpus file look like they actually use list_corpus instead. I might be wrong, but it might be worth taking a look and fixing things (in a separate PR) if needed.
FastText tests could use some refactoring. To avoid copy/pasting long blocks of redundant code, I tweaked some of the existing ones to simply call them a second time passing shrink_window=False in the ones that this PR adds.

piskvorky · 2021-06-17T11:20:35Z

I switched back to for loops

This is desirable, thanks. The for-loop is clearer and I see you also made it more efficient (less looping).

gensim/models/doc2vec_inner.pyx

mpenkov

Looking good! I'm late to the party, so I see most of the issues have been ironed out. I did manage to find a few nitpicks and left you some comments. Please have a look.

mpenkov · 2021-06-22T01:29:27Z

gensim/models/doc2vec_inner.pyx

@@ -365,8 +365,12 @@ def train_document_dbow(model, doc_words, doctag_indexes, alpha, work=None,

    if c.train_words:
        # single randint() call avoids a big thread-synchronization slowdown
-        for i, item in enumerate(model.random.randint(0, c.window, c.document_len)):
-            c.reduced_windows[i] = item
+        if model.shrink_windows:


This is a bit repetitive.

Wouldn't it be better to define a model.initialize_reduced_windows method that does this, and call it when needed?

def initialize_reduced_windows(self, ...): if model.shrink_windows: for i, item in ...: ... else: ...

That way, if we find a logic error in the initialization code, we don't have to remember to fix it in a half a dozen other places (e.g. https://github.com/RaRe-Technologies/gensim/pull/3169/files#r649243075).

We should probably also do a bounds check on c.reduced_windows to ensure that it's at least document_len elements long.

I agree with you on the potential for refactored code that would ease future maintenance and updates. That being said, I actually built on the existing, repetitive code, so in my humble opinion this could be handled as a distinct PR (this one would add a feature, while a second one would enhance the validated, working code).

I also think that your suggestion of a model.initialize_reduced_windows is not entirely suitable here:
If I am not mistaken, in this code, c is not model: it is a special C structure (of a different class depending on the trained model class: word2vec, doc2vec or fasttext), and the reduced_windows PyArray is initialized to have the proper size, so there should not be a need to do a bounds check.
I do not know how we could implement a common method (or function) that would also fill the array with either deterministic or random values; but I do agree with you that it could be worth it.

To be honest I lack experience with cython, so that I do not feel entirely at ease with diving into this refactoring. I am willing to give it a try at some point if you want, but would prefer it to be a distinct effort (and PR) from the current one - partly because Mathis and I are pushing the shrink_windows feature due to our needing it for a current research project at work, while contributing to refactoring the code base would be something I would (willingly) do on my time off.

Alas, there's a lot of pre-existing cut & paste duplication between all these related algorithms & modes that could be refactored. I think that'd be OK to note-and-defer-for-later, as either an issue or a FIXME comment in the code.

OK, we can deal with this separately.

gensim/models/word2vec.py

gensim/test/test_fasttext.py

mpenkov · 2021-06-22T01:38:54Z

gensim/test/test_fasttext.py

@@ -497,12 +497,12 @@ def test_sg_hs_training(self):
            overlap_count, 2,
            "only %i overlap in expected %s & actual %s" % (overlap_count, expected_sims_words, sims_gensim_words))

-    def test_sg_hs_training_fromfile(self):
+    def test_sg_hs_training_fromfile(self, shrink_windows=True):


https://docs.pytest.org/en/6.2.x/example/parametrize.html

Suggested change

def test_sg_hs_training_fromfile(self, shrink_windows=True):

@pytest.mark.parametrize('shrink_windows', [True, False])

def test_sg_hs_training_fromfile(self, shrink_windows):

I pushed two commits making use of that feature. The first one only applied to the FastText tests, as they were the one you attached this review to. The second one applies to the other two test sets (wor2vec and doc2vec); if you feel like the latter should not have been changed, we can discard this commit.

Update: actually, pytest.mark.parametrize does not work as expected on methods (although it does on functions). I will try to find a way to make it work, otherwise I will undo the last two commits.

Oh, in fact it probably will not work due to the test classes being unittest.TestCase subclasses.

https://docs.pytest.org/en/6.2.x/unittest.html

The following pytest features do not work [in unittest.TestCase subclasses], and probably never will due to different design philosophies: Fixtures (except for autouse fixtures, see below); Parametrization; Custom hooks;

I am revoking the last two commits as a consequence.

Thanks for pointing that out. I didn't know pytest parameterization is incompatible with unittest.TestCase.

Luckily, in this particular case, the unittest.TestCase class isn't doing anything. I'll get parameterization working in a separate commit.

Well, we both learned something :)

mpenkov · 2021-06-29T06:09:54Z

Merged. Thank you @M-Demay !

The repo wasn't accepting maintainer commits, so I'm taking care of this here.

Implemented reduced_windows argument for Word2Vec.

6afb3ff

Co-Authored-By: Mathis Demay <mathis.demay.md@gmail.com>

piskvorky requested changes Jun 9, 2021

View reviewed changes

gensim/models/word2vec_inner.pyx Outdated Show resolved Hide resolved

gensim/models/word2vec_inner.pyx Outdated Show resolved Hide resolved

gensim/models/word2vec_corpusfile.pyx Show resolved Hide resolved

piskvorky requested changes Jun 9, 2021

View reviewed changes

gensim/models/word2vec.py Outdated Show resolved Hide resolved

gojomo reviewed Jun 9, 2021

View reviewed changes

gensim/models/word2vec.py Outdated Show resolved Hide resolved

gojomo reviewed Jun 9, 2021

View reviewed changes

gensim/models/word2vec.py Outdated Show resolved Hide resolved

gojomo reviewed Jun 9, 2021

View reviewed changes

gensim/models/word2vec.py Outdated Show resolved Hide resolved

gojomo reviewed Jun 9, 2021

View reviewed changes

gensim/models/word2vec_corpusfile.pyx Outdated Show resolved Hide resolved

pandrey-fr force-pushed the fixed-winsize branch from ba747a7 to f323b68 Compare June 10, 2021 15:15

pandrey-fr added 2 commits June 10, 2021 17:42

Improve the way reduced_windows is passed around and used.

9bdb5a1

Renamed reduced_windows to shrink_windows.

77698cd

pandrey-fr force-pushed the fixed-winsize branch from f323b68 to 77698cd Compare June 10, 2021 15:42

pandrey-fr and others added 2 commits June 11, 2021 18:24

Removed shrink_windows argument from Word2Vec.train.

11a464f

Merge branch 'develop' into fixed-winsize

7b6e73a

pandrey-fr added 5 commits June 15, 2021 22:47

Aesthetic fix.

a657518

Fixed old word2vec models' reloading.

ff0d30e

Fixed undue docstring.

f98eb43

Added shrink_windows argument to Doc2Vec.

3b27f1c

Added shrink_windows argument to FastText.

27a49d7

gojomo changed the title ~~Implemented reduced_windows argument for Word2Vec.~~ Implement shrink_windows argument for Word2Vec. Jun 16, 2021

pandrey-fr added 3 commits June 17, 2021 10:45

Added tests for shrink_windows=False in Word2Vec-based models.

486b3f5

Added docstring mentions of shrink_window being experimental.

f0af84e

gojomo reviewed Jun 17, 2021

View reviewed changes

gensim/models/doc2vec_inner.pyx Outdated Show resolved Hide resolved

mpenkov requested changes Jun 22, 2021

View reviewed changes

Rolled back some purely aesthetic changes.

49cf50e

pandrey-fr force-pushed the fixed-winsize branch from 3481607 to 49cf50e Compare June 22, 2021 15:56

mpenkov mentioned this pull request Jun 29, 2021

Reduce duplication in word2vec.pyx source code #3184

Open

mpenkov merged commit 2b9b1b3 into piskvorky:develop Jun 29, 2021

mpenkov added a commit that referenced this pull request Jun 29, 2021

polishing up after #3169

2a41200

The repo wasn't accepting maintainer commits, so I'm taking care of this here.

gojomo mentioned this pull request Nov 15, 2021

Is this a bug in the CBOW code or my misunderstanding? #1873

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `shrink_windows` argument for Word2Vec. #3169

Implement `shrink_windows` argument for Word2Vec. #3169

M-Demay commented Jun 9, 2021

pandrey-fr commented Jun 14, 2021

gojomo commented Jun 15, 2021 •

edited

Loading

pandrey-fr commented Jun 15, 2021

gojomo commented Jun 16, 2021 •

edited

Loading

pandrey-fr commented Jun 17, 2021

piskvorky commented Jun 17, 2021

mpenkov left a comment

mpenkov Jun 22, 2021

pandrey-fr Jun 22, 2021

gojomo Jun 22, 2021

mpenkov Jun 29, 2021

mpenkov Jun 22, 2021

pandrey-fr Jun 22, 2021

pandrey-fr Jun 22, 2021

pandrey-fr Jun 22, 2021

pandrey-fr Jun 22, 2021

mpenkov Jun 29, 2021

pandrey-fr Jun 30, 2021

mpenkov commented Jun 29, 2021

	def test_sg_hs_training_fromfile(self, shrink_windows=True):
	@pytest.mark.parametrize('shrink_windows', [True, False])
	def test_sg_hs_training_fromfile(self, shrink_windows):

Implement shrink_windows argument for Word2Vec. #3169

Implement shrink_windows argument for Word2Vec. #3169

Conversation

M-Demay commented Jun 9, 2021

pandrey-fr commented Jun 14, 2021

gojomo commented Jun 15, 2021 • edited Loading

pandrey-fr commented Jun 15, 2021

gojomo commented Jun 16, 2021 • edited Loading

pandrey-fr commented Jun 17, 2021

piskvorky commented Jun 17, 2021

mpenkov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpenkov commented Jun 29, 2021

Implement `shrink_windows` argument for Word2Vec. #3169

Implement `shrink_windows` argument for Word2Vec. #3169

gojomo commented Jun 15, 2021 •

edited

Loading

gojomo commented Jun 16, 2021 •

edited

Loading