-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement position-dependent weighting to fastText #2840
Comments
Sure, why not. The motivation is clear and shouldn't disrupt users. I haven't looked into these papers (or perhaps forgot already) but what is the expected impact on performance (speed) and code complexity? Why wouldn't you want to use this weighting, as a user? |
For every position in the context window ([-c, ..., −1, 1, ..., c]), you need to maintain a D-dimensional position vector in a 2c × D weight matrix. The weight matrix needs to be stored with the model. For each forward pass of the CBOW, we need to reweight the context word vectors with the position vectors. For each backward pass, we also need to compute and apply the gradient to the position vectors. Since 2c is small (default is 10), the impact on speed and storage size should be negligible.
According to the papers, the weighting should be a net benefit. |
I am. Hopefully, I can get to it by the end of May. |
If it's truly a benefit for many/all uses, and optional/negligible-cost for others, it seems a valuable option. I'm always a bit wary of papers' claimed marginal advantages: they often cherry-pick some set of configs/evals, & ignore other tradeoffs or possible-avenues to the same improvement, to make their novelty seem stronger. (So for example: better on analogies doesn't always mean better for other IR/classification tasks. Or sometimes a "5 point improvement" glosses over any extra time/memory required for the technique, or the fact that some other existing meta-parameter optimization, if you're willing to spend more time/memory, can get the same improvement.) So I would hope any contribution would include some demos of its advantages, and proof of its negligible effect when unused. If I understand correctly, the positional weights are learned along with the vectors. Do they vary per target word? And, to the extent this replaces the crude Note that the FT classes have gotten a major reworking (and a lot of head-scratching-nonsense has been removed) in my work-in-progress big KeyedVectors refactor (PR #2698), which is almost stable enough for wider review. (But, at the moment, will still get temp commits, rebases & force-pushes.) So you'd likely want to base any changes on that, rather than And: doing a bunch of cleanup/maintenance has left a bad taste in my mouth for one-off contributions of narrow utility, with varied code styles, and crude |
My plan is to measure speed and word analogy task accuracy for all 10 languages four times:
The expected scenario is that the first three tests yield the same results with insignificant differences in both accuracy and speed (otherwise, my implementation or #2698 are incorrect), and that the fourth test reproduces Grave's accuracies (above) with insignificant speed loss.
They do not, only with position (see Mnih et al. [2013, Section 3] and Mikolov et al. [2017, Section 2.2]).
My plan is to have a new method |
That's a pretty good & reassuring plan. Unless you've already got an implementation, be aware there's enough refactoring in #2698 that doing it on both Adding a separate Appending the positional weights to the same array as the full-word vectors might work, but might add complications to the extent those are shared with the lookup-centric (and indpendently usable/savable/mutatable) (Also: do you mean Thinking more generally about the range of possibilities suggested by this optimization – & so, more 'blue sky' thoughts than any sort of direct constraints on your contribution – it strikes me that:
Again: more researchy-thoughts than anything else, but wanted to dump them here where they've been triggered. |
My understanding is that the NLP world has moved on to unified "deep NN" architectures (e.g. the transformer), and word embeddings are now used primarily in that context. So questions around internal *2vec trade-offs and APIs must be considered from that perspective. What makes the production and use of trained vectors easier? The advantage of gensim's *2vec is its unique scalability, performance, cost (free) and Python. So we don't want to compromise on that, but other than that I have no preference for what-wraps-what. I'd like to see the Gensim pipeline evolve with more complex document & similarity modeling (gensim's mission) of course. But seeing how the external contributions went over the past 4 years (one disaster after another in my estimation), I know gensim cannot possibly survive more of the same. That leaves major changes to me and perhaps 1-2 other contributors whom I trust could hold the course. But who are all quite busy, so… not holding my breath. Good SW engineers are hard to come by, while "AI researchers" and "data scientists" are a dime a dozen. |
@Witko - definitely! |
@gojomo Sanity checks for the new architecture seem to be passing: |
You may also notice that the memory use & saved-size of the model is much, much smaller. |
Indeed. Seems like we're no longer storing the These are the model files (15G) stored via $ du -hc data/models/*_ref=2360459_*
154M data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32
4.0K data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.accuracy
4.0K data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.duration
2.6G data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.trainables.syn1neg.npy
2.3G data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.trainables.vectors_ngrams_lockf.npy
2.6G data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.trainables.vectors_vocab_lockf.npy
2.3G data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.wv.vectors_ngrams.npy
2.6G data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.wv.vectors.npy
2.6G data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.wv.vectors_vocab.npy
15G total These are the model files (11G) stored via $ du -hc data/models/*_ref=c0e0169_*
343M data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32
4.0K data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32.accuracy
4.0K data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32.duration
2.6G data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32.syn1neg.npy
2.3G data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32.wv.vectors_ngrams.npy
2.6G data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32.wv.vectors.npy
2.6G data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32.wv.vectors_vocab.npy
11G total |
Another forthcoming PR, #2892, will avoid saving the |
@gojomo Those will be welcome improvements. Meanwhile, I have produced a first implementation of the position-dependent weighting and I am currently training a model. There is currently no BLAS, since we need a Hadamard product followed by a dot product ([position vector ∘ context word vector]ᵀ ⋅ center word vector) without intermediary arrays (new1 and work are already occupied), which will need some more thinking over, so I just did the simplest thing that works for now. If the implementation works, I will be creating a PR in the next week. |
Our results show that just a few (10 out of 300) word vectors can be positionally weighted to receive the benefit of full position-dependent weighting (rightmost, 72%* accuracy). This makes our implementation of position-dependent weighting practically fast: More details in #2905 (comment). * All accuracies are currently reported on a small dev set (2.4B words) over one epoch and do not necessarily reflect the performance on a larger dataset. |
We held a seminar session yesterday regarding the use of word embeddings in information retrieval applications. |
The work from draft #2905 has been accepted as a journal paper, which was co-authored by @piskvorky and is to appear in J.UCS 28:2. To conduct experiments for the paper, I have produced the high-level PInE library that uses a fork of Gensim 3.8.3 for model training. If there is interest, we can rebase the fork on top of the current |
In Section 2.2 of the 2017 “Advances” paper by Mikolov et al., a position-dependent weighting is introduced to the context vector computation in the fastText CBOW model. On Common Crawl, this extension has led to 5 point improvement in total accuracy on the English word analogy task.
Despite the usefulness of the position-dependent weighting extension, it has not been implemented to facebook's code base yet. I propose to add this extension to Gensim's fastText and be the first publicly available implementation of the extension.
Table 4, row CBOW of the 2018 “Learning” paper by Grave et al. shows the word analogy task results of CBOW fastText models for 10 languages trained on Wikipedia:
The CBOW fastText models use the position-dependent weighting extension and the default parameters described in Section 4.3 of the 2017 “Enriching” paper by Bojanowski et al.: hash table bucket size 2*10^6, 5 epochs, 300 vector dimensions, negative sampling loss with 5 negative samples, initial learning rate 0.05, sampling threshold 10^-4, and window size 5. The correctness of our implementation can be checked by comparing with Grave's results (above).
The text was updated successfully, but these errors were encountered: