-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM] Fix FastText hash function incompatibility #2059 #2233
Conversation
This reverts commit 6c06fbc.
Hi, just thought I'd check on how this is going. Are you still actively working on this? |
@mpenkov |
I've discussed this with @menshikh-iv, and there is a backwards compatibility issue that we need to handle. If we just change the hash function, then we will be unable to load older models into gensim. What we need to do is:
The above mechanism should ensure that old models keep working with newer versions of gensim. Does that make sense? |
@mpenkov cc: @menshikh-iv |
@aneesh-joshi don't worry, @mpenkov will do that. From you @aneesh-joshi, I'm waiting for tests (sanity-hecks: comparing with FB FT output) & describe in details how you prepared this test (what's FB FT version you used, how you run that, etc), thanks for your help 👍 |
Tests are pushed. Summary of test:I have added three txt files in
The script
Once the models are built, you should run This runs a grep on the If you use the current My changes:Since gensim defaults to the Cython version and I didn't want to touch the Cython code just yet, I have intentionally raise a Import Error so that it defaults to the Python version. I know, bad practice! But this is just for testing. Most of the code here won't be merged. Further, I have changed the hash function as suggested in the corresponding issue. I have locally run
However, I don't know if me raising an ImportError affects this. |
thank you @aneesh-joshi 👍 |
@mpenkov @menshikh-iv What other work is pending for this PR? |
@aneesh-joshi Nothing else needed from your side (at least now), thanks for you help 🔥 |
@menshikh-iv |
thanks @aneesh-joshi, you helped us much 👍 , overlapped by #2313 |
* WIP * Handle incompatible float size condition * update docstring * move regression test to unit tests * WIP * introduced Tracker class * added log examples * initialize trainables weights when loading native model * adding script to trigger bug * minor documentation changes * improve unit test * retrained toy model $ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100 Read 0M words Number of words: 22 Number of labels: 0 Progress: 100.0% words/sec/thread: 209 lr: 0.000000 loss: 4.100698 eta: 0h0m -14m * update bucket parameter in unit test * update unit test * WIP * retrain model with a smaller dimensionality this will make it easier to debug manually $ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100 -dim 5 Read 0M words Number of words: 22 Number of labels: 0 Progress: 100.0% words/sec/thread: 199 lr: 0.000000 loss: 0.000000 eta: 0h0m * git add docs/fasttext-notes.md * adding some comments and fixmes * minor refactoring, update tests * update notes * update notes * initialize wv.vectors_vocab * init vectors_vocab properly * add test_sanity_vectors * no longer segfaulting * adding tests for in-vocab out-of-vocab words * removing old test it cannot pass by design: training is non-deterministic, so conditions must be tightly controlled to guarantee reproducibility, and that is too much effort for a unit test * fix typo in test, reduce tolerance * update test_continuation, it now fails * test continued training with gensim model * compare vectors_ngrams before and after * disable test reruns for now * set min_count=0 * initialize wv.buckets_word prior to continuing training This avoid a null dereference that could previously be reproduced with: python -c "from gensim.test.test_fasttext;import NativeTrainingContinuationTest as A;A().test_continuation_gensim()" * making all tests pass * add bucket param to FastTextKeyedVectors constructor * minor refactoring: split out _load_vocab function * minor refactoring: split out _load_trainables method * removing Tracker class: it was for debugging only * remove debugging print statement * docstring fixes * remove FIXME, leave this function alone * add newlines at the end of docstrings * remove comment * re-enable test reruns in tox.ini * remove print statements from tests * git rm trigger.py * refactor FB model loading code Move the lower-level FB model loading code to a new module. Implement alternative, simpler _load_fast_text_format function. Add unit tests to compare alternative and existing implementation. * fix bug with missing ngrams (still need cleanup of hash2index & testing) * fix cython implementation of _ft_hash (based on #2233) * decrease tolerances in unit tests * add test case for native models and hashes * add working/broken hash implementations for py/cy and tests * minor fixup around hashes * add oov test * adding hash compatibility tests for FastText model * git rm gensim.xml native.xml * minor fix in comment * refactoring: extract _pad_random and _pad ones functions * deprecate struct_unpack public method * refactoring: get rid of init_ngrams_weights method * refactoring: move matrix init to FastTextKeyedVectors * refactoring: move init_ngrams_post_load method to FastTextKeyedVectors * refactoring: move trainables.get_vocab_word_vecs to wv.calculate_vectors * refactoring: simplify reset_ngrams_weights method * refactoring: improve separation of concerns between model and vectors * refactoring: improve separation of concerns between model and vectors * refactoring: remove unused vectors_vocab_norm attribute * review response: update ft_hash_broken comment * review response: revert changes to broken hash function * review response: handle .bucket backwards compatibility * review response: adjust warning text * tox -e flake8 * tox -e flake8-docs * review response: store .compatible_hash in vectors only * Revert "refactoring: remove unused vectors_vocab_norm attribute" This reverts commit 07c84f5. We have to worry about backwards compatibility if we remove this attribute, and it's not worth doing that as part of this PR. * review response: remove critical log comments * review response: fix docstring in fasttext_bin.py Also ran python -m doctest gensim/models/fasttext_bin.py to check the docstring is correctly executable. * review response: make fasttext_bin an internal module * review response: skip cython tests if cython is disabled * review response: use np.allclose instead of array_equals * refactoring: simplify ngrams_weights matrix init * fixup: remove unused vectors_lockf attribute * fixup in _load_fasttext_model function * minor refactoring in unit tests * adjust unit test vectors_lockf is only for word2vec. FastText implementation uses vectors_ngrams_lockf and vectors_vocab_lockf only. * temporarily disabling some assertions in tests * document vectors_vocab_lockf and vectors_ngrams_lockf * refactoring: further simplify growth of _lockf matrices * remove outdated comments * fix deprecation warnings * improve documentation for FastTextKeyedVectors * refactoring: extract L2 norm functions * add LoadFastTextFormatTest * refactoring: remove old FB I/O code * refactoring: FastTextKeyedVectors.init_post_load method * refactoring: simplify init_post_load method * refactoring: simplify init_post_load method * refactoring: simplify calculate_vectors, rename to adjust_vectors * refactoring: simplify _lockf init * remove old tests * tox -e flake8 * fixup: introduce OrderedDict to _fasttext_bin.py The order of the words matters. In the previous implementation, this was maintained explicitly via the index2word list, but using an OrderedDict achieves the same thing. The main idea is that we iterate over the vocab terms in the right order in the prepare_vocab function. * add unicode prefixes to literals for Py2.7 compatibility * more Py2.7 compatibility stuff * refactoring: extract _process_fasttext_vocab function * still more Py2.7 compatibility stuff * adding additional assertion * re-enable disabled assertions * delete out of date comment * Revert "re-enable disabled assertions" This reverts commit 01d84d1. * more work on init_post_load function, update unit tests * update unit tests * review response: remove FastTextVocab class, keep alias * review response: simplify _l2_norm_inplace function * review response: add docstring * review response: update docstring * review response: move import * review response: adjust skip message * reivew response: add test_hash_native * review response: explain how model was generated * review response: explain how expected values were generated * review response: add test for long OOV word * review response: remove unused comments * review response: remove comment * add test_continuation_load_gensim * update model using gensim 3.6.0 * review response: get rid of struct_unpack This is an internal method masquerading as a public one. There is no reason for anyone to call it. Removing it will have no effect on pickling/unpickling, as methods do not get serialized. Therefore, removing it is safe. * review response: implement handling for zero bucket edge case * review response: add test_save_load * review response: add test_save_load_native * workaround appveyor tempfile issue * fix tests
Note: this is WIP. More thought needs to go into everything.
From a preliminary glance, this looks like it might work. But it's failing some unittests.
Changes will be made as more feedback comes in.
Addresses: #2059