FastText with hs=1 and negative>0 #2550

mino98 · 2019-07-03T14:00:58Z

The docs says:

hs ({1,0}, optional) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.
negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

So I would expect that if hs=1, the model will use hierarchical softmax and the value of negative is irrelevant, right?

This doesn't seem to be the case: if I run two perfectly deterministic (i.e., worker=1, fixed seed and PYTHONHASHSEED set) runs on the same input with:

hs=1 negative=0, and
hs=1 negative=5
resulting word vectors have different values.

How can hs and negative coexist? I've looked at the code but I couldn't find any place implementing the "exclusive" logic implied by the documentation above.

The text was updated successfully, but these errors were encountered:

gojomo · 2019-07-12T19:28:37Z

I don't read that documentation text as necessarily implying that hs=1 forces the value of negative to be ignored, or that the two modes are necessarily exclusive. If hs=1, hierarchical-softmax is used (as it says). If negative>0, then negative-sampling is used (as it also says). Hence, if both are non-zero, it can be read as meaning both are used. (I'd agree it could be clearer.)

Historically, it just so happens that the original word2vec.c allowed both hs=1 and negative>0 to cause both output layers to be instantiated, and trained, in an interleaved manner. There's probably no good reason to do both, but it worked – and so gensim's word2vec support, closely modeled after that code, allowed the same thing. I'm not sure what FB's initial fasttext implementation did, but it may have behaved the same way, and in any case, it appears from your report that dual-mode training is what's happening. (A further confirmation would be if hs=1, negative=5 training takes about as long as hs=1, negative=0 training plus hs=0, negative=5 training.)

mino98 · 2019-07-15T13:08:22Z

Thanks @gojomo for clarifying the historical reasons behind it.
I also didn't see a reason for using both, and I'm still unsure how such "dual-mode training" actually works: are we doing negative sampling before hs? What do you mean by "in an interleaved manner"?

I'll again try to follow the flow in the source, but it's quite confusing to do so for this specific case.
Thanks again.

gojomo · 2019-07-15T14:49:35Z

Essentially, both hierarchical-softmax and negative-sampling are different ways to interpret the output-layer of the neural-network, then assess the errors for back-propagation (through the "hidden layer" to the input word-vectors). If both are enabled, two sets of internal NN hidden-to-output weights are allocated. (Historically in the word2vec.c source & earlier gensim versions, these were called syn1 for HS and syn1neg for negative-sampling.)

When doing the main, example by example training loop, each type of training was considered in turn, such that it was like having two models, but with a shared set of input-word-vectors. So one sentence would go through all the normal steps of HS training, using syn1 to calc output values, with back-propagated corrections to the word-vectors – then all the steps of negative-sampling training, using syn1neg to calc output values, with back-propagated corrections to the same word-vectors. Essentially, shared simultaneous training. So with both enabled, even a setting like epochs=5 was essentially doing 5 HS epochs, and 5 negative-sampling epochs - interleaved.

That could look like a benefit, if you weren't counting run-time. "Wow, 5 epochs with both enabled is better than either alone!" Perhaps, but probably not as good as giving either mode more total run-time, for example extra epochs.

mino98 · 2019-07-16T10:43:52Z

Thanks a lot @gojomo, very clear now. I'll close this issue.

datistiquo · 2020-05-18T18:40:01Z

@mino98 @gojomo

I get a bug if use hs=1 with negative=0 for updating vocab:

model.build_vocab(sentences=s, update=True)

gives:

  File "C:\Users\\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line 1734, in update_weights
    self.syn1 = vstack([self.syn1, zeros((gained_vocab, self.layer1_size), dtype=REAL)])
AttributeError: 'FastTextTrainables' object has no attribute 'syn1'

Maybe I open an issue.

Do you get the same if you want to continue training with above hyperparameters?

gojomo · 2020-05-18T19:37:52Z

If you have a minimal example to reproduce this error, you should file a new issue with the complete set of steps.

mino98 closed this as completed Jul 16, 2019

gojomo mentioned this issue May 18, 2020

Conflicts between hyperparameters for negative sampling? #2844

Closed

gojomo mentioned this issue Feb 16, 2023

Word2Vec and Doc2Vec do not update word embeddings if negative keyword is set to 0 #1983

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastText with hs=1 and negative>0 #2550

FastText with hs=1 and negative>0 #2550

mino98 commented Jul 3, 2019

gojomo commented Jul 12, 2019

mino98 commented Jul 15, 2019 •

edited

Loading

gojomo commented Jul 15, 2019

mino98 commented Jul 16, 2019

datistiquo commented May 18, 2020

gojomo commented May 18, 2020

FastText with hs=1 and negative>0 #2550

FastText with hs=1 and negative>0 #2550

Comments

mino98 commented Jul 3, 2019

gojomo commented Jul 12, 2019

mino98 commented Jul 15, 2019 • edited Loading

gojomo commented Jul 15, 2019

mino98 commented Jul 16, 2019

datistiquo commented May 18, 2020

gojomo commented May 18, 2020

mino98 commented Jul 15, 2019 •

edited

Loading