[DNM] Fix FastText hash function incompatibility #2059 #2233

aneesh-joshi · 2018-10-16T05:52:47Z

Note: this is WIP. More thought needs to go into everything.
From a preliminary glance, this looks like it might work. But it's failing some unittests.

Changes will be made as more feedback comes in.

Addresses: #2059

…into develop

This reverts commit 6c06fbc.

mpenkov · 2018-12-15T07:51:59Z

Hi, just thought I'd check on how this is going. Are you still actively working on this?

aneesh-joshi · 2018-12-15T18:12:07Z

@mpenkov
I have locally tested my changes. I believe it's working.
I will upload my tests soon.

mpenkov · 2018-12-16T02:04:13Z

I've discussed this with @menshikh-iv, and there is a backwards compatibility issue that we need to handle. If we just change the hash function, then we will be unable to load older models into gensim. What we need to do is:

Introduce a use_new_hash attribute to the model
This attribute determines which hash function to use (the old buggy one, or your fixed one)
When creating new models, it should be true by default
When loading models, we should check for this attribute. If it is missing, assume that we're dealing with an old model

The above mechanism should ensure that old models keep working with newer versions of gensim. Does that make sense?

aneesh-joshi · 2018-12-16T05:20:04Z

@mpenkov cc: @menshikh-iv
Yes, makes sense.
I'll get started with it.

menshikh-iv · 2018-12-16T09:33:13Z

@aneesh-joshi don't worry, @mpenkov will do that.

From you @aneesh-joshi, I'm waiting for tests (sanity-hecks: comparing with FB FT output) & describe in details how you prepared this test (what's FB FT version you used, how you run that, etc), thanks for your help 👍

aneesh-joshi · 2018-12-16T20:51:37Z

@menshikh-iv @mpenkov

Tests are pushed.

Summary of test:

I have added three txt files in gensim/test/test_data/ft_test_data

no_unicode.txt : contains no unicode words
only_unicode.txt : contains only unicode words (I think. May contains some non unicode characters)
unicode_non_unicode_mix.txt : contains a mix of unicode and non unicode

The script test_ft.sh :

clones the facebook github repo
runs make
builds 3 models for the three txt files mentioned above

Once the models are built, you should run test_models.sh

This runs a grep on the .vec file created by FB and then builds a gensim model from the .bin. These two values should match.

If you use the current develop, you will get different vectors for the same words from ft and gensim's reconstruction.
If you use this PR's code, you will get the same results.

My changes:

Since gensim defaults to the Cython version and I didn't want to touch the Cython code just yet, I have intentionally raise a Import Error so that it defaults to the Python version. I know, bad practice! But this is just for testing. Most of the code here won't be merged.

Further, I have changed the hash function as suggested in the corresponding issue.

I have locally run test_fasttext.py and it passes all cases:

Ran 40 tests in 165.098s

OK

However, I don't know if me raising an ImportError affects this.

menshikh-iv · 2018-12-17T03:52:13Z

thank you @aneesh-joshi 👍

aneesh-joshi · 2018-12-17T05:31:24Z

@mpenkov @menshikh-iv
I'd be glad to add the backward compatibility check. I've done it once before for w2v. Unless @mpenkov has already finished or started it?

What other work is pending for this PR?
(I know Cython is but I'm holding that off till the current changes are approved.)

menshikh-iv · 2018-12-17T06:00:30Z

@aneesh-joshi
@mpenkov work on massive fixing FT issues, your PR will help them much with #2059 (he already work on it).

Nothing else needed from your side (at least now), thanks for you help 🔥

aneesh-joshi · 2018-12-17T16:09:54Z

@menshikh-iv
Ok, I would just like to point out to @mpenkov that my change to fasttext.py seems to be resolving the issue superficially at least.

menshikh-iv · 2019-01-08T10:01:57Z

thanks @aneesh-joshi, you helped us much 👍 , overlapped by #2313

* WIP * Handle incompatible float size condition * update docstring * move regression test to unit tests * WIP * introduced Tracker class * added log examples * initialize trainables weights when loading native model * adding script to trigger bug * minor documentation changes * improve unit test * retrained toy model $ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100 Read 0M words Number of words: 22 Number of labels: 0 Progress: 100.0% words/sec/thread: 209 lr: 0.000000 loss: 4.100698 eta: 0h0m -14m * update bucket parameter in unit test * update unit test * WIP * retrain model with a smaller dimensionality this will make it easier to debug manually $ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100 -dim 5 Read 0M words Number of words: 22 Number of labels: 0 Progress: 100.0% words/sec/thread: 199 lr: 0.000000 loss: 0.000000 eta: 0h0m * git add docs/fasttext-notes.md * adding some comments and fixmes * minor refactoring, update tests * update notes * update notes * initialize wv.vectors_vocab * init vectors_vocab properly * add test_sanity_vectors * no longer segfaulting * adding tests for in-vocab out-of-vocab words * removing old test it cannot pass by design: training is non-deterministic, so conditions must be tightly controlled to guarantee reproducibility, and that is too much effort for a unit test * fix typo in test, reduce tolerance * update test_continuation, it now fails * test continued training with gensim model * compare vectors_ngrams before and after * disable test reruns for now * set min_count=0 * initialize wv.buckets_word prior to continuing training This avoid a null dereference that could previously be reproduced with: python -c "from gensim.test.test_fasttext;import NativeTrainingContinuationTest as A;A().test_continuation_gensim()" * making all tests pass * add bucket param to FastTextKeyedVectors constructor * minor refactoring: split out _load_vocab function * minor refactoring: split out _load_trainables method * removing Tracker class: it was for debugging only * remove debugging print statement * docstring fixes * remove FIXME, leave this function alone * add newlines at the end of docstrings * remove comment * re-enable test reruns in tox.ini * remove print statements from tests * git rm trigger.py * refactor FB model loading code Move the lower-level FB model loading code to a new module. Implement alternative, simpler _load_fast_text_format function. Add unit tests to compare alternative and existing implementation. * fix bug with missing ngrams (still need cleanup of hash2index & testing) * fix cython implementation of _ft_hash (based on #2233) * decrease tolerances in unit tests * add test case for native models and hashes * add working/broken hash implementations for py/cy and tests * minor fixup around hashes * add oov test * adding hash compatibility tests for FastText model * git rm gensim.xml native.xml * minor fix in comment * refactoring: extract _pad_random and _pad ones functions * deprecate struct_unpack public method * refactoring: get rid of init_ngrams_weights method * refactoring: move matrix init to FastTextKeyedVectors * refactoring: move init_ngrams_post_load method to FastTextKeyedVectors * refactoring: move trainables.get_vocab_word_vecs to wv.calculate_vectors * refactoring: simplify reset_ngrams_weights method * refactoring: improve separation of concerns between model and vectors * refactoring: improve separation of concerns between model and vectors * refactoring: remove unused vectors_vocab_norm attribute * review response: update ft_hash_broken comment * review response: revert changes to broken hash function * review response: handle .bucket backwards compatibility * review response: adjust warning text * tox -e flake8 * tox -e flake8-docs * review response: store .compatible_hash in vectors only * Revert "refactoring: remove unused vectors_vocab_norm attribute" This reverts commit 07c84f5. We have to worry about backwards compatibility if we remove this attribute, and it's not worth doing that as part of this PR. * review response: remove critical log comments * review response: fix docstring in fasttext_bin.py Also ran python -m doctest gensim/models/fasttext_bin.py to check the docstring is correctly executable. * review response: make fasttext_bin an internal module * review response: skip cython tests if cython is disabled * review response: use np.allclose instead of array_equals * refactoring: simplify ngrams_weights matrix init * fixup: remove unused vectors_lockf attribute * fixup in _load_fasttext_model function * minor refactoring in unit tests * adjust unit test vectors_lockf is only for word2vec. FastText implementation uses vectors_ngrams_lockf and vectors_vocab_lockf only. * temporarily disabling some assertions in tests * document vectors_vocab_lockf and vectors_ngrams_lockf * refactoring: further simplify growth of _lockf matrices * remove outdated comments * fix deprecation warnings * improve documentation for FastTextKeyedVectors * refactoring: extract L2 norm functions * add LoadFastTextFormatTest * refactoring: remove old FB I/O code * refactoring: FastTextKeyedVectors.init_post_load method * refactoring: simplify init_post_load method * refactoring: simplify init_post_load method * refactoring: simplify calculate_vectors, rename to adjust_vectors * refactoring: simplify _lockf init * remove old tests * tox -e flake8 * fixup: introduce OrderedDict to _fasttext_bin.py The order of the words matters. In the previous implementation, this was maintained explicitly via the index2word list, but using an OrderedDict achieves the same thing. The main idea is that we iterate over the vocab terms in the right order in the prepare_vocab function. * add unicode prefixes to literals for Py2.7 compatibility * more Py2.7 compatibility stuff * refactoring: extract _process_fasttext_vocab function * still more Py2.7 compatibility stuff * adding additional assertion * re-enable disabled assertions * delete out of date comment * Revert "re-enable disabled assertions" This reverts commit 01d84d1. * more work on init_post_load function, update unit tests * update unit tests * review response: remove FastTextVocab class, keep alias * review response: simplify _l2_norm_inplace function * review response: add docstring * review response: update docstring * review response: move import * review response: adjust skip message * reivew response: add test_hash_native * review response: explain how model was generated * review response: explain how expected values were generated * review response: add test for long OOV word * review response: remove unused comments * review response: remove comment * add test_continuation_load_gensim * update model using gensim 3.6.0 * review response: get rid of struct_unpack This is an internal method masquerading as a public one. There is no reason for anyone to call it. Removing it will have no effect on pickling/unpickling, as methods do not get serialized. Therefore, removing it is safe. * review response: implement handling for zero bucket edge case * review response: add test_save_load * review response: add test_save_load_native * workaround appveyor tempfile issue * fix tests

aneesh-joshi added 30 commits February 9, 2018 01:14

handle deprecation

e249ed4

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

62f6c82

…into develop

handle max_count

1677e98

change flag name

e8c08f8

make flake8 compatible

258d033

move max_vocab to prepare vocab

875c65c

correct max_vocab semantics

0aa8426

remove unnecessary nextline

390f333

fix bug and make flake8 complaint

8c508c7

refactor code and change sorting to key based

c826b19

add tests

35dc681

introduce effective_min_count

67f6a14

make flake8 compliant

7b1f612

remove clobbering of min_count

fafee70

remove min_count assertion

9d99660

.\gensim\models\word2vec.py

6c06fbc

Revert ".\gensim\models\word2vec.py"

c5a0e6e

This reverts commit 6c06fbc.

rename max_vocab to max_final_vocab

fdd2aab

update test to max_final_vocab

974d587

move and modify comment docs

ddb3556

make flake8 compliant

c54d8a9

refactor word2vec.py

f379616

handle possible old model load errors

46d3885

include effective_min_count tests

2cf5625

fix merge conf with w2v

da40953

update hash fn

5897d6d

fix unneeded files

6ab02ce

fix unneeded files

5c08491

fix flake issues

13865b4

fix flake issues

dcb098c

aneesh-joshi added 2 commits October 16, 2018 01:48

add angle braces

a2b3c30

remove angular brace

72631ca

aneesh-joshi mentioned this pull request Oct 23, 2018

fasttext ft_hash and unicode handling #2059

Closed

aneesh-joshi added 5 commits December 16, 2018 15:29

add tests for fix hash function code

fdd05fe

add comments

2717089

Delete test_fix_ft_hashfn_file_unicode.txt

b76aef0

Delete test_fix_ft_hashfn_file_mixed.txt

7a12d58

Delete test_fix_ft_hashfn_file_non_unicode.txt

08fcce0

menshikh-iv changed the title ~~[WIP] Fix FastText hash function incompatibility #2059~~ [DNM] Fix FastText hash function incompatibility #2059 Dec 17, 2018

menshikh-iv closed this Jan 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] Fix FastText hash function incompatibility #2059 #2233

[DNM] Fix FastText hash function incompatibility #2059 #2233

aneesh-joshi commented Oct 16, 2018

mpenkov commented Dec 15, 2018

aneesh-joshi commented Dec 15, 2018

mpenkov commented Dec 16, 2018

aneesh-joshi commented Dec 16, 2018

menshikh-iv commented Dec 16, 2018

aneesh-joshi commented Dec 16, 2018 •

edited

Loading

menshikh-iv commented Dec 17, 2018

aneesh-joshi commented Dec 17, 2018

menshikh-iv commented Dec 17, 2018

aneesh-joshi commented Dec 17, 2018

menshikh-iv commented Jan 8, 2019

[DNM] Fix FastText hash function incompatibility #2059 #2233

[DNM] Fix FastText hash function incompatibility #2059 #2233

Conversation

aneesh-joshi commented Oct 16, 2018

mpenkov commented Dec 15, 2018

aneesh-joshi commented Dec 15, 2018

mpenkov commented Dec 16, 2018

aneesh-joshi commented Dec 16, 2018

menshikh-iv commented Dec 16, 2018

aneesh-joshi commented Dec 16, 2018 • edited Loading

Summary of test:

My changes:

menshikh-iv commented Dec 17, 2018

aneesh-joshi commented Dec 17, 2018

menshikh-iv commented Dec 17, 2018

aneesh-joshi commented Dec 17, 2018

menshikh-iv commented Jan 8, 2019

aneesh-joshi commented Dec 16, 2018 •

edited

Loading