Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue290 upgrade fasttext #292

Closed
wants to merge 7 commits into from

Conversation

mvsjober
Copy link
Contributor

@mvsjober mvsjober commented Jul 5, 2019

Upgrade fastText back to official PyPI version, and enable the pretrained vectors feature.

Closes #290

@codecov
Copy link

codecov bot commented Jul 5, 2019

Codecov Report

Merging #292 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #292      +/-   ##
==========================================
+ Coverage   99.36%   99.39%   +0.02%     
==========================================
  Files          55       55              
  Lines        2845     3130     +285     
==========================================
+ Hits         2827     3111     +284     
- Misses         18       19       +1
Impacted Files Coverage Δ
annif/backend/fasttext.py 98.63% <100%> (ø) ⬆️
tests/test_backend_fasttext.py 100% <100%> (ø) ⬆️
tests/conftest.py 100% <100%> (ø) ⬆️
annif/cli.py 99.23% <0%> (-0.23%) ⬇️
tests/test_cli.py 100% <0%> (ø) ⬆️
annif/backend/vw_multi.py 100% <0%> (ø) ⬆️
tests/test_project.py 100% <0%> (ø) ⬆️
annif/project.py 100% <0%> (ø) ⬆️
annif/backend/tfidf.py 100% <0%> (ø) ⬆️
annif/backend/ensemble.py 100% <0%> (ø) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ea1130...ce5062d. Read the comment docs.

@osma
Copy link
Member

osma commented Jul 5, 2019

Thanks for the PR!

It seems that the fastText tests are not running under Travis CI, which leads to low test coverage. I'm guessing that the reason is that in .travis.yml, fasttextmirror is still being installed. Can you try changing that too?

@osma osma added this to the Short term milestone Jul 5, 2019
@mvsjober
Copy link
Contributor Author

mvsjober commented Jul 5, 2019

I fixed the installation in .travis.yml. BTW, also this page needs to be updated in the wiki: https://github.com/NatLibFi/Annif/wiki/Optional-features-and-dependencies#fasttext-backend - what is the process for that?

@osma
Copy link
Member

osma commented Jul 5, 2019

Thanks, much better now! The tests seem to be running fine after the change.

The wiki page can simply be edited by anyone, including you. There is a bit of a synchronization challenge here - ideally the wiki would be changed at the same time the PR is merged. Perhaps we could simply drop the alternative recipe (pip install fasttextmirror or pip install fasttext) and recommend pip install annif[fasttext] instead - I think it should do the right thing (install either fasttextmirror or fasttext, depending on which version of Annif has been installed) but I'm a little bit unsure whether it uses the dependency information from the locally installed Annif or the one available on PyPI, which could be an older or newer version.

Do you think it would be easy to add a unit test for the pretrained vectors feature? It would require adding a file with some pretrained vectors and then adding a test (in tests/test_backend_fasttext.py) that uses it. There is some toy data under tests/corpora/archaeology that could perhaps be used to create the vectors. Preferably the vectors file should be small (at most a few MB) so it doesn't bloat the repo. The idea would be to verify that it loads without error - quality of classification results is not important.

@mvsjober
Copy link
Contributor Author

mvsjober commented Jul 5, 2019

I thought about making a test but dismissed the idea as I would have to add a huge file somehow, but a toy data test I can definitely do.

@mvsjober
Copy link
Contributor Author

mvsjober commented Jul 5, 2019

There is still one problem that needs to be addressed before we can merge. With some test sets I get a ValueError:

  File "/homeappl/home/mvsjober/appl_taito/Annif/annif/backend/fasttext.py", line 119, in _suggest_chunks
    chunktexts, project, limit)
  File "/homeappl/home/mvsjober/appl_taito/Annif/annif/backend/fasttext.py", line 114, in _predict_chunks
    for chunktext in chunktexts])), limit)
  File "/homeappl/home/mvsjober/.local/share/virtualenvs/Annif-YopBXEME/lib/python3.5/site-packages/fasttext/FastText.py", line 151, in predict
    result_as_pair = np.array(predictions, dtype=dt)
ValueError: setting an array element with a sequence

This appears to be a bug in the upstream fastText module. The line result_as_pair = np.array(predictions, dtype=dt) expects predictions to be a list of of predictions for each text, each item should be another list of equal length. However as we pass both k=1000 and threshold=0.0 to multilinePredict() it can happen that it sometimes returns less that k=1000 items if there are less than k over the threshold.

@mvsjober
Copy link
Contributor Author

mvsjober commented Jul 8, 2019

Bug reported to fastText: facebookresearch/fastText#846

@mvsjober
Copy link
Contributor Author

mvsjober commented Oct 2, 2019

Seems the bug has now been fixed upstream, but we should probably wait until there's an official release of fastText with the bugfix included.

@osma
Copy link
Member

osma commented Oct 2, 2019

Agree that we should wait for a PyPI release.
There are some merge conflicts that need to be fixed. I bet these are related to pinning specific versions of libraries. If that's the case they should be trivial to fix.

@osma osma modified the milestones: Short term, 0.48 May 11, 2020
@osma osma self-assigned this May 13, 2020
@osma
Copy link
Member

osma commented May 13, 2020

fastText 0.9.2 has been released on PyPI: https://pypi.org/project/fasttext/

Meanwhile some merge conflicts have emerged, I will try to sort them out so that we can merge this for the next release 0.48.

@osma
Copy link
Member

osma commented May 13, 2020

I couldn't figure out how to update this PR. I tried to push to the original PR branch on CSCfi/Annif but got a "permission denied" error, even though the PR seems to allow pushing by upstream users. Anyway, I created a new PR #409 instead.

@osma osma closed this May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade fastText back to official PyPI version
2 participants