Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XTransformer backend #540

Closed
wants to merge 2,623 commits into from
Closed

Add XTransformer backend #540

wants to merge 2,623 commits into from

Conversation

mo-fu
Copy link
Contributor

@mo-fu mo-fu commented Dec 6, 2021

This PR adds XTransformer as an optional backend to Annif. For now it does not yet use distilbert in the default configuration as this is not yet available on pypi.
The tests for the backend resort to mocking as training would download a pretrained model of size at least 500 mb.
Also we should discuss cache directories. At the moment xtransformer will download models from the huggingface hub to ~/,cache/huggingface Is this behavior desired for Annif or should the cache be placed in the data folder?
I also haven't modified the docker container yet. When I installed pecos in a venv it required BLAS libraries so this would probably have to be added to the container. Additionally pecos will install the GPU enabled pytorch. Meaning the container size will grow. Therefore I wanted to check with you first before adding it.

@codecov
Copy link

codecov bot commented Dec 6, 2021

Codecov Report

Patch coverage: 38.51% and project coverage change: -2.63 ⚠️

Comparison is base (48b23f7) 99.61% compared to head (367e493) 96.99%.

❗ Current head 367e493 differs from pull request most recent head aa96ebc. Consider uploading reports for the commit aa96ebc to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #540      +/-   ##
==========================================
- Coverage   99.61%   96.99%   -2.63%     
==========================================
  Files          87       89       +2     
  Lines        6034     6297     +263     
==========================================
+ Hits         6011     6108      +97     
- Misses         23      189     +166     
Impacted Files Coverage Δ
tests/test_backend_xtransformer.py 9.27% <9.27%> (ø)
annif/backend/xtransformer.py 12.50% <12.50%> (ø)
annif/backend/__init__.py 98.30% <83.33%> (-1.70%) ⬇️
annif/backend/fasttext.py 100.00% <100.00%> (ø)
annif/backend/omikuji.py 97.36% <100.00%> (-0.10%) ⬇️
annif/backend/stwfsa.py 100.00% <100.00%> (ø)
annif/util.py 98.71% <100.00%> (+0.41%) ⬆️
tests/test_backend.py 100.00% <100.00%> (ø)
tests/test_util.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@lgtm-com
Copy link

lgtm-com bot commented Dec 6, 2021

This pull request introduces 1 alert when merging beb9ea9 into 82e55c6 - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@lgtm-com
Copy link

lgtm-com bot commented Dec 6, 2021

This pull request introduces 1 alert when merging d40c6bc into 82e55c6 - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@mo-fu mo-fu force-pushed the master branch 2 times, most recently from 3cc99e4 to 637e0cf Compare December 6, 2021 17:07
@lgtm-com
Copy link

lgtm-com bot commented Dec 6, 2021

This pull request introduces 1 alert when merging 637e0cf into 82e55c6 - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@osma
Copy link
Member

osma commented Dec 17, 2021

Thanks @mo-fu , great work! I like how you also enhanced atomic_save and added apply_param_parse_config which you then used also in existing backends.

The tests for the backend resort to mocking as training would download a pretrained model of size at least 500 mb.

Sounds reasonable.

At the moment xtransformer will download models from the huggingface hub to ~/,cache/huggingface Is this behavior desired for Annif or should the cache be placed in the data folder?

The punkt model for NLTK is currently handled in a similar way - it's normally placed under ~/nltk_data/. That's maybe OK for NLTK (it's only 49M) but I agree that it could make sense to place downloaded models under the data directory, as you suggest. It's also a bit similar for the new spaCy analyzer (#374) which also needs to store downloaded models and those can be quite large.

How about data/model_cache or something along those lines? Then that could be further split up with subdirectories named by type/package, for example huggingface, spacy, nltk etc.

I also haven't modified the docker container yet. When I installed pecos in a venv it required BLAS libraries so this would probably have to be added to the container. Additionally pecos will install the GPU enabled pytorch. Meaning the container size will grow. Therefore I wanted to check with you first before adding it.

So far we've tried to support all features in the container. So following that policy, we would need to add this to the container as well. But I guess it also depends on how much the container would grow? If it's very much (50% or more) then we could consider making a separate flavor of the container that includes XTransformer, or requiring the user to build their own container with XTransformer support if they need it. Does @juhoinkinen have a comment?

Also, can you give an example for how to train and test this backend for example with the tutorial data sets?

annif/util.py Outdated
os.close(tempfd)
final name. To save a directory explicitly set filename=None."""

if filename:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't that much shared code between the file vs. directory cases. I wonder if it would be better to leave the atomic_save function as it is (only supporting files, not directories) and adding a new function atomic_save_dir which takes care of directories only. The implementations could be simpler, without so many if clauses.

@juhoinkinen
Copy link
Member

Installing Pecos increases the Docker image size quite much: 1.5 GB -> 3.5 GB (download size of PyTorch is nearly 900 MBs).

So it looks like it would be best to have a separate image that includes also Pecos.

First I thought making a Dockerfile.custom that a user could customize and build to have an image with only the optional backends they need, possibly with XTransformer. But actually already the default Dockerfile is quite easily customisable (the optional backends are installed in just one line, except for fastText that is built in the builder stage). It would be simpler to just have a commented line with pip install .[pecos] in the default Dockerfile. (In Wiki there could be instructions to remove the other optional backends if they are not needed.)

But again, now I noticed that instead of selecting dependencies to install by commenting/uncommenting lines build-args could be used. I try how that would work.

However, should we offer an image flavor in quay.io that includes Pecos (with a pecos tag or similar)?

@osma
Copy link
Member

osma commented Dec 20, 2021

I'm sorry, I just caused a conflict in this PR by merging #544, which optimizes the startup time of Annif and contains a complete rewrite of annif/backend/__init__.py. You will need to adjust your code accordingly. The pattern should be quite obvious - just remember to:

  • keep alphabetical order (xtransformer should be the penultimate backend, just before yake)
  • look at the code for the other optional backends (fasttext, omikuji, nn_ensemble, yake) and use the same approach
  • add a unit test for the missing package into tests/test_backend.py

@osma
Copy link
Member

osma commented Jan 18, 2022

I see that you fixed the conflicts and also addressed some other suggestions @mo-fu - great!
Could you give an example too:

Also, can you give an example for how to train and test this backend for example with the tutorial data sets?

@sonarcloud
Copy link

sonarcloud bot commented Feb 4, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@osma
Copy link
Member

osma commented Feb 4, 2022

But again, now I noticed that instead of selecting dependencies to install by commenting/uncommenting lines build-args could be used. I try how that would work.

This was already implemented in PR #548 (and extended for spaCy models in PR #527), both released as part of 0.56 and documented in the wiki. So we already have the mechanism in place for customizing Docker images with build args.

However, should we offer an image flavor in quay.io that includes Pecos (with a pecos tag or similar)?

Maybe - let's postpone that discussion a little bit.

@mo-fu
Copy link
Contributor Author

mo-fu commented Mar 14, 2022

A small status update. To figure out good parameter values for the YSO data set from the tutorial repository. Unfortunately the suggest command is really slow. Presumably since I run the parameter optimization on a GPU and copying the representation of every input individually takes a lot of time. I will try whether this changes when using the CPU. But maybe in the future it is possible to explore how parallel evaluation can be improved for backends that support it directly.

@sonarcloud
Copy link

sonarcloud bot commented Mar 14, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@osma osma mentioned this pull request Mar 14, 2022
@osma
Copy link
Member

osma commented Mar 14, 2022

Thanks for the update @mo-fu !
It makes a lot of sense to do parallel evaluation (and other operations as well) directly with backends that support many documents at a time. I've had similar thoughts about other backends that could do this but the limiting factor is the current suggest method that only supports one document at a time. I created a new issue #579 for implementing a batched version of suggest.

@juhoinkinen juhoinkinen changed the base branch from master to api-i18n March 8, 2023 15:30
@juhoinkinen juhoinkinen closed this Mar 8, 2023
@sonarcloud
Copy link

sonarcloud bot commented Mar 8, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@juhoinkinen
Copy link
Member

Sorry, I did not close this intentionally. I planned to test this (and maybe add batch processing, for which support has been recently implemented), so I merged current master to the PR branch, fixed conflicts, and pushed. GitHub showed very many commits that did not belong to this PR, and they could have been removed by switching the base branch to some other and then back (like in NatLibFi/FintoAI#12 (comment)).

So I switched the base branch to api-i18n, but then the PR got closed automatically, and I cannot reopen it, apparently because the PR and base branches share no history ("mo-fu:master branch has no history in common with NatLibFi:api-i18n"). And the base branch of closed PR cannot be changed...

I tried to merge current master to api-i18n, but that did not help. Maybe api-i18n could be completely rewritten to allow the PR to be reopened.

juhoinkinen referenced this pull request Mar 9, 2023
@juhoinkinen
Copy link
Member

As adviced in SO, for reopening I tried first merge the current base branch to the PR branch with the special option:
git pull origin api-i18n --allow-unrelated-histories

However, after that I cannot push to the branch anymore:

To github.com:mo-fu/Annif.git
 ! [remote rejected] master -> master (permission denied)
error: failed to push some refs to 'git@github.com:mo-fu/Annif.git'

Previously I could push, e.g. the merge of current master with conflict fixes. Maybe it is not possible to push to a branch of a PR that is closed, or it is because of the --allow-unrelated-histories option. Or @mo-fu could have the the needed permissions, the PR is coming from his fork after all.

I created the branch mo-fu-master, which has the commits of mo-fu:master, and my merge of master with conflict fixes. But a new PR cannot be opened from that branch, because

This branch is not ahead of the upstream master.

And actually the three-dot diff does not see anything, but two-dot diff does.

@juhoinkinen
Copy link
Member

I asked from GitHub support if they can switch the base branch back to master to allow reopening this PR.

@juhoinkinen
Copy link
Member

GitHub support responded:

It is not possible to re-open a PR once its merged and closed, unfortunately. This is the expected behaviour.

This is because merged and closed pull requests cannot be synchronized with the underlaying Git repository. You can only re-open a closed pull request if it contains un-merged commits.

You could open a new PR based on the same merged branch, if you push at least one new commit to the pull request branch.

They are answering my a bit another question (merged PR) than is the current situation and which I asked for (automatically closed apparently due to "unrelated histories" of branches).

At least I cannot anymore push to this PR branch, so unless mo-fu can somehow fix this, the other option is to open a new PR (there is the branch mo-fu-master in this Annif repository, from which a PR can now be opened, whereas yesterday it seemed not possible).

@mo-fu
Copy link
Contributor Author

mo-fu commented Mar 13, 2023

@juhoinkinen I added you as a collaborator to my fork. Maybe this helps. I can also try the git pull origin api-i18n --allow-unrelated-histories later.

@juhoinkinen
Copy link
Member

Thanks @mo-fu, now I was able to do the git pull origin api-i18n --allow-unrelated-histories and push, but the commits did not appear in this PR. This seems different to the behavior that was reported in SO.

I also tried to use GitHub API to change the base branch back to master and re-open the PR, but with no luck, the response was: "Cannot change the base branch of a closed pull request."

It seems there is no way this PR can be re-opened :( Or maybe rebasing the whole mo-fu:master branch on NatLibFi:api-i18, so the branches would definitely have a common history...? But then it could be that GitHub just would not detect the change for this PR, because I think pulling api-i18n should already have made a common history.

The easiest way to proceed would be to make a new PR from the commit Merge branch 'master' of github.com:mo-fu/Annif into mo-fu-master, which has the (nearly) current master from NatLibFi merged (mo-fu:master could be just resetted to that commit to remove the merge of api-i18n).

@juhoinkinen
Copy link
Member

The easiest way to proceed would be to make a new PR from the commit Merge branch 'master' of github.com:mo-fu/Annif into mo-fu-master, which has the (nearly) current master from NatLibFi merged (mo-fu:master could be just resetted to that commit to remove the merge of api-i18n).

Hi @mo-fu! Would you like to open a new PR for XTransformer? So that GitHub history will show credit of it to you. No need to mind about the PR description, we can edit it.

I could also open the PR from your fork, if you dont want to bother with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants