Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastText with hs=1 and negative>0 #2550

Closed
mino98 opened this issue Jul 3, 2019 · 6 comments
Closed

FastText with hs=1 and negative>0 #2550

mino98 opened this issue Jul 3, 2019 · 6 comments

Comments

@mino98
Copy link

mino98 commented Jul 3, 2019

The docs says:

hs ({1,0}, optional) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.
negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

So I would expect that if hs=1, the model will use hierarchical softmax and the value of negative is irrelevant, right?

This doesn't seem to be the case: if I run two perfectly deterministic (i.e., worker=1, fixed seed and PYTHONHASHSEED set) runs on the same input with:

  1. hs=1 negative=0, and
  2. hs=1 negative=5
    resulting word vectors have different values.

How can hs and negative coexist? I've looked at the code but I couldn't find any place implementing the "exclusive" logic implied by the documentation above.

@gojomo
Copy link
Collaborator

gojomo commented Jul 12, 2019

I don't read that documentation text as necessarily implying that hs=1 forces the value of negative to be ignored, or that the two modes are necessarily exclusive. If hs=1, hierarchical-softmax is used (as it says). If negative>0, then negative-sampling is used (as it also says). Hence, if both are non-zero, it can be read as meaning both are used. (I'd agree it could be clearer.)

Historically, it just so happens that the original word2vec.c allowed both hs=1 and negative>0 to cause both output layers to be instantiated, and trained, in an interleaved manner. There's probably no good reason to do both, but it worked – and so gensim's word2vec support, closely modeled after that code, allowed the same thing. I'm not sure what FB's initial fasttext implementation did, but it may have behaved the same way, and in any case, it appears from your report that dual-mode training is what's happening. (A further confirmation would be if hs=1, negative=5 training takes about as long as hs=1, negative=0 training plus hs=0, negative=5 training.)

@mino98
Copy link
Author

mino98 commented Jul 15, 2019

Thanks @gojomo for clarifying the historical reasons behind it.
I also didn't see a reason for using both, and I'm still unsure how such "dual-mode training" actually works: are we doing negative sampling before hs? What do you mean by "in an interleaved manner"?

I'll again try to follow the flow in the source, but it's quite confusing to do so for this specific case.
Thanks again.

@gojomo
Copy link
Collaborator

gojomo commented Jul 15, 2019

Essentially, both hierarchical-softmax and negative-sampling are different ways to interpret the output-layer of the neural-network, then assess the errors for back-propagation (through the "hidden layer" to the input word-vectors). If both are enabled, two sets of internal NN hidden-to-output weights are allocated. (Historically in the word2vec.c source & earlier gensim versions, these were called syn1 for HS and syn1neg for negative-sampling.)

When doing the main, example by example training loop, each type of training was considered in turn, such that it was like having two models, but with a shared set of input-word-vectors. So one sentence would go through all the normal steps of HS training, using syn1 to calc output values, with back-propagated corrections to the word-vectors – then all the steps of negative-sampling training, using syn1neg to calc output values, with back-propagated corrections to the same word-vectors. Essentially, shared simultaneous training. So with both enabled, even a setting like epochs=5 was essentially doing 5 HS epochs, and 5 negative-sampling epochs - interleaved.

That could look like a benefit, if you weren't counting run-time. "Wow, 5 epochs with both enabled is better than either alone!" Perhaps, but probably not as good as giving either mode more total run-time, for example extra epochs.

@mino98
Copy link
Author

mino98 commented Jul 16, 2019

Thanks a lot @gojomo, very clear now. I'll close this issue.

@mino98 mino98 closed this as completed Jul 16, 2019
@datistiquo
Copy link

@mino98 @gojomo

I get a bug if use hs=1 with negative=0 for updating vocab:

model.build_vocab(sentences=s, update=True)

gives:

  File "C:\Users\\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line 1734, in update_weights
    self.syn1 = vstack([self.syn1, zeros((gained_vocab, self.layer1_size), dtype=REAL)])
AttributeError: 'FastTextTrainables' object has no attribute 'syn1'

Maybe I open an issue.

Do you get the same if you want to continue training with above hyperparameters?

@gojomo
Copy link
Collaborator

gojomo commented May 18, 2020

If you have a minimal example to reproduce this error, you should file a new issue with the complete set of steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants