-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add XTransformer backend #540
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #540 +/- ##
==========================================
- Coverage 99.61% 96.99% -2.63%
==========================================
Files 87 89 +2
Lines 6034 6297 +263
==========================================
+ Hits 6011 6108 +97
- Misses 23 189 +166
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
This pull request introduces 1 alert when merging beb9ea9 into 82e55c6 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging d40c6bc into 82e55c6 - view on LGTM.com new alerts:
|
3cc99e4
to
637e0cf
Compare
This pull request introduces 1 alert when merging 637e0cf into 82e55c6 - view on LGTM.com new alerts:
|
Thanks @mo-fu , great work! I like how you also enhanced
Sounds reasonable.
The punkt model for NLTK is currently handled in a similar way - it's normally placed under How about
So far we've tried to support all features in the container. So following that policy, we would need to add this to the container as well. But I guess it also depends on how much the container would grow? If it's very much (50% or more) then we could consider making a separate flavor of the container that includes XTransformer, or requiring the user to build their own container with XTransformer support if they need it. Does @juhoinkinen have a comment? Also, can you give an example for how to train and test this backend for example with the tutorial data sets? |
annif/util.py
Outdated
os.close(tempfd) | ||
final name. To save a directory explicitly set filename=None.""" | ||
|
||
if filename: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't that much shared code between the file vs. directory cases. I wonder if it would be better to leave the atomic_save
function as it is (only supporting files, not directories) and adding a new function atomic_save_dir
which takes care of directories only. The implementations could be simpler, without so many if clauses.
Installing Pecos increases the Docker image size quite much: 1.5 GB -> 3.5 GB (download size of PyTorch is nearly 900 MBs). So it looks like it would be best to have a separate image that includes also Pecos. First I thought making a But again, now I noticed that instead of selecting dependencies to install by commenting/uncommenting lines build-args could be used. I try how that would work. However, should we offer an image flavor in quay.io that includes Pecos (with a |
I'm sorry, I just caused a conflict in this PR by merging #544, which optimizes the startup time of Annif and contains a complete rewrite of
|
I see that you fixed the conflicts and also addressed some other suggestions @mo-fu - great!
|
Kudos, SonarCloud Quality Gate passed! |
This was already implemented in PR #548 (and extended for spaCy models in PR #527), both released as part of 0.56 and documented in the wiki. So we already have the mechanism in place for customizing Docker images with build args.
Maybe - let's postpone that discussion a little bit. |
A small status update. To figure out good parameter values for the YSO data set from the tutorial repository. Unfortunately the suggest command is really slow. Presumably since I run the parameter optimization on a GPU and copying the representation of every input individually takes a lot of time. I will try whether this changes when using the CPU. But maybe in the future it is possible to explore how parallel evaluation can be improved for backends that support it directly. |
Kudos, SonarCloud Quality Gate passed! |
Thanks for the update @mo-fu ! |
Restore ability to use vocab language different from project language
…f AnnifVocabulary
…from-subject-module Suppress duplicate log messages from subject module
…r-py3.10 Fix slow CI/CD runs for Python 3.10
Make the prediction on the batch in one call of the NN model
…-suggestions Batch suggest in ensemble backends
Kudos, SonarCloud Quality Gate passed! |
Sorry, I did not close this intentionally. I planned to test this (and maybe add batch processing, for which support has been recently implemented), so I merged current master to the PR branch, fixed conflicts, and pushed. GitHub showed very many commits that did not belong to this PR, and they could have been removed by switching the base branch to some other and then back (like in NatLibFi/FintoAI#12 (comment)). So I switched the base branch to api-i18n, but then the PR got closed automatically, and I cannot reopen it, apparently because the PR and base branches share no history ("mo-fu:master branch has no history in common with NatLibFi:api-i18n"). And the base branch of closed PR cannot be changed... I tried to merge current master to api-i18n, but that did not help. Maybe api-i18n could be completely rewritten to allow the PR to be reopened. |
As adviced in SO, for reopening I tried first merge the current base branch to the PR branch with the special option: However, after that I cannot push to the branch anymore:
Previously I could push, e.g. the merge of current master with conflict fixes. Maybe it is not possible to push to a branch of a PR that is closed, or it is because of the I created the branch mo-fu-master, which has the commits of mo-fu:master, and my merge of master with conflict fixes. But a new PR cannot be opened from that branch, because
And actually the three-dot diff does not see anything, but two-dot diff does. |
I asked from GitHub support if they can switch the base branch back to master to allow reopening this PR. |
GitHub support responded:
They are answering my a bit another question (merged PR) than is the current situation and which I asked for (automatically closed apparently due to "unrelated histories" of branches). At least I cannot anymore push to this PR branch, so unless mo-fu can somehow fix this, the other option is to open a new PR (there is the branch mo-fu-master in this Annif repository, from which a PR can now be opened, whereas yesterday it seemed not possible). |
@juhoinkinen I added you as a collaborator to my fork. Maybe this helps. I can also try the |
Thanks @mo-fu, now I was able to do the I also tried to use GitHub API to change the base branch back to master and re-open the PR, but with no luck, the response was: It seems there is no way this PR can be re-opened :( Or maybe rebasing the whole mo-fu:master branch on NatLibFi:api-i18, so the branches would definitely have a common history...? But then it could be that GitHub just would not detect the change for this PR, because I think pulling api-i18n should already have made a common history. The easiest way to proceed would be to make a new PR from the commit Merge branch 'master' of github.com:mo-fu/Annif into mo-fu-master, which has the (nearly) current master from NatLibFi merged (mo-fu:master could be just resetted to that commit to remove the merge of api-i18n). |
Hi @mo-fu! Would you like to open a new PR for XTransformer? So that GitHub history will show credit of it to you. No need to mind about the PR description, we can edit it. I could also open the PR from your fork, if you dont want to bother with this. |
This PR adds XTransformer as an optional backend to Annif. For now it does not yet use distilbert in the default configuration as this is not yet available on pypi.
The tests for the backend resort to mocking as training would download a pretrained model of size at least 500 mb.
Also we should discuss cache directories. At the moment xtransformer will download models from the huggingface hub to
~/,cache/huggingface
Is this behavior desired for Annif or should the cache be placed in thedata
folder?I also haven't modified the docker container yet. When I installed pecos in a venv it required BLAS libraries so this would probably have to be added to the container. Additionally pecos will install the GPU enabled pytorch. Meaning the container size will grow. Therefore I wanted to check with you first before adding it.