Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix non-deterministic test failures #2196

Merged
merged 1 commit into from
Sep 24, 2018
Merged

Fix non-deterministic test failures #2196

merged 1 commit into from
Sep 24, 2018

Conversation

menshikh-iv
Copy link
Contributor

@menshikh-iv menshikh-iv commented Sep 23, 2018

Some of the non-deterministic tests failed frequently, I spend much time to rerun it again and again (especially in release time when we need to build gensim many times), most frequent fails are:

  • TestFastTextModel.test_cbow_hs_online
  • TestWord2VecModel.test_cbow_hs_fromfile
  • TestDoc2VecModel.testLoadOldModel

This fails almost always happens on python3.

Here I pin PYTHONHASHSEED for testing purposes (will do the same thing for gensim-wheels repo too).

CC: @gojomo @piskvorky hope that you have no objection.

See also MacPython/gensim-wheels#12

@menshikh-iv menshikh-iv merged commit da5c603 into develop Sep 24, 2018
@menshikh-iv menshikh-iv deleted the pin-seed branch September 24, 2018 09:33
@piskvorky
Copy link
Owner

piskvorky commented Sep 24, 2018

  • Pros: More reliable builds, faster releases, less confusion for maintainers and contributors investigating unrelated CI failures.
  • Cons: moving away from functional tests (does the algo still work acceptably?) to more menial bit-checking (algo may be broken for a different seed; need to redefine ground-truth with each algo change).

Overall, worth it IMO (pros >> cons).

Of course, ideally we'd want both: reliable AND functional. I still don't know why these tests are so flaky (not enough training data / epochs? too non-deterministic to be reliable?).

@gojomo
Copy link
Collaborator

gojomo commented Sep 25, 2018

I'd have to see the individual failures to be sure, but if it really is a small set of commonly-failing tests, they could probably improved to be 'tighter', rather than for the whole environment to be forced into string-hash-determinism. In particular, that it's two variations of the "cbow_hs" test commonly failing makes me very suspicious of that test's design.

And for TestDoc2VecModel.testLoadOldModel, I'm a bit more baffled - as there's no new training there, normal sources of randomness shouldn't risk changes in substantive results. It's loading all static models! There must be a true problem in the test, I'd think.

@menshikh-iv
Copy link
Contributor Author

@gojomo
About "individual failures": see an example - MacPython/gensim-wheels#12 (comment)

About determination - I agree that issue can be in test design itself (I'm almost sure) and current pinning is the workaround (not fix). If you have time to look into tests - this will be pretty nice (if we can fix problematic tests itself instead of "global pinning" - this will be great).

@gojomo
Copy link
Collaborator

gojomo commented Sep 25, 2018

I don't have time to dig deeper, but:

  • The TestFastTextModel.test_cbow_hs_online test looks really fishy - I wouldn't expect a single iteration over some tiny unit-test dataset to give very consistent results anywhere, and the test failure doesn't look like a "near miss" - -0.96 is nowhere near the target of 0.0.

  • The TestWord2VecModel.test_cbow_hs_fromfile looks like a near-miss that might happen rarely or be drivable to a negligible rate with slight tightening.

  • The TestDoc2VecModel.testLoadOldModel remains surprising, because a frozen model loaded from disk should give the same ranked results to any future most_similar() tests. It might be due to perfect-ties in similarity-value resulting in different ranked orders in differently-string-hashed dictionaries. If so, a perfect tie should be very odd in any sufficiently-trained/varied model, so perhaps any old model that includes such a tie be replaced. But if not, another, fix might be to sort most_similar() results by both similarity-score and key-lexicography.

I'm not philosophically opposed to forcing determinism in some cases, having argued for it before when frustrated by jittery test results – but I've also seen other bugs that such determinism would've hidden. (In particular, some of the older flakiness in these tests was likely due to the longstanding bug in inference learning-rate-decay.) Here, if these tests are often failing under some build/test systems (like in this case the MacPython setup), but never failing in our own setup, then there's some other source of variance we may want to understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants