Add XTransformer backend #540

mo-fu · 2021-12-06T09:40:30Z

This PR adds XTransformer as an optional backend to Annif. For now it does not yet use distilbert in the default configuration as this is not yet available on pypi.
The tests for the backend resort to mocking as training would download a pretrained model of size at least 500 mb.
Also we should discuss cache directories. At the moment xtransformer will download models from the huggingface hub to ~/,cache/huggingface Is this behavior desired for Annif or should the cache be placed in the data folder?
I also haven't modified the docker container yet. When I installed pecos in a venv it required BLAS libraries so this would probably have to be added to the container. Additionally pecos will install the GPU enabled pytorch. Meaning the container size will grow. Therefore I wanted to check with you first before adding it.

codecov · 2021-12-06T09:44:37Z

Codecov Report

Patch coverage: 38.51% and project coverage change: -2.63 ⚠️

Comparison is base (48b23f7) 99.61% compared to head (367e493) 96.99%.

❗ Current head 367e493 differs from pull request most recent head aa96ebc. Consider uploading reports for the commit aa96ebc to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #540      +/-   ##
==========================================
- Coverage   99.61%   96.99%   -2.63%     
==========================================
  Files          87       89       +2     
  Lines        6034     6297     +263     
==========================================
+ Hits         6011     6108      +97     
- Misses         23      189     +166

Impacted Files	Coverage Δ
tests/test_backend_xtransformer.py	`9.27% <9.27%> (ø)`
annif/backend/xtransformer.py	`12.50% <12.50%> (ø)`
annif/backend/__init__.py	`98.30% <83.33%> (-1.70%)`	⬇️
annif/backend/fasttext.py	`100.00% <100.00%> (ø)`
annif/backend/omikuji.py	`97.36% <100.00%> (-0.10%)`	⬇️
annif/backend/stwfsa.py	`100.00% <100.00%> (ø)`
annif/util.py	`98.71% <100.00%> (+0.41%)`	⬆️
tests/test_backend.py	`100.00% <100.00%> (ø)`
tests/test_util.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

lgtm-com · 2021-12-06T10:35:52Z

This pull request introduces 1 alert when merging beb9ea9 into 82e55c6 - view on LGTM.com

new alerts:

1 for Module is imported with 'import' and 'import from'

lgtm-com · 2021-12-06T14:29:40Z

This pull request introduces 1 alert when merging d40c6bc into 82e55c6 - view on LGTM.com

new alerts:

1 for Module is imported with 'import' and 'import from'

lgtm-com · 2021-12-06T18:01:29Z

This pull request introduces 1 alert when merging 637e0cf into 82e55c6 - view on LGTM.com

new alerts:

1 for Module is imported with 'import' and 'import from'

osma · 2021-12-17T13:16:06Z

Thanks @mo-fu , great work! I like how you also enhanced atomic_save and added apply_param_parse_config which you then used also in existing backends.

The tests for the backend resort to mocking as training would download a pretrained model of size at least 500 mb.

Sounds reasonable.

At the moment xtransformer will download models from the huggingface hub to ~/,cache/huggingface Is this behavior desired for Annif or should the cache be placed in the data folder?

The punkt model for NLTK is currently handled in a similar way - it's normally placed under ~/nltk_data/. That's maybe OK for NLTK (it's only 49M) but I agree that it could make sense to place downloaded models under the data directory, as you suggest. It's also a bit similar for the new spaCy analyzer (#374) which also needs to store downloaded models and those can be quite large.

How about data/model_cache or something along those lines? Then that could be further split up with subdirectories named by type/package, for example huggingface, spacy, nltk etc.

I also haven't modified the docker container yet. When I installed pecos in a venv it required BLAS libraries so this would probably have to be added to the container. Additionally pecos will install the GPU enabled pytorch. Meaning the container size will grow. Therefore I wanted to check with you first before adding it.

So far we've tried to support all features in the container. So following that policy, we would need to add this to the container as well. But I guess it also depends on how much the container would grow? If it's very much (50% or more) then we could consider making a separate flavor of the container that includes XTransformer, or requiring the user to build their own container with XTransformer support if they need it. Does @juhoinkinen have a comment?

Also, can you give an example for how to train and test this backend for example with the tutorial data sets?

osma · 2021-12-17T13:18:29Z

annif/util.py

-    os.close(tempfd)
+    final name. To save a directory explicitly set filename=None."""
+
+    if filename:


There isn't that much shared code between the file vs. directory cases. I wonder if it would be better to leave the atomic_save function as it is (only supporting files, not directories) and adding a new function atomic_save_dir which takes care of directories only. The implementations could be simpler, without so many if clauses.

juhoinkinen · 2021-12-20T11:03:37Z

Installing Pecos increases the Docker image size quite much: 1.5 GB -> 3.5 GB (download size of PyTorch is nearly 900 MBs).

So it looks like it would be best to have a separate image that includes also Pecos.

First I thought making a Dockerfile.custom that a user could customize and build to have an image with only the optional backends they need, possibly with XTransformer. But actually already the default Dockerfile is quite easily customisable (the optional backends are installed in just one line, except for fastText that is built in the builder stage). It would be simpler to just have a commented line with pip install .[pecos] in the default Dockerfile. (In Wiki there could be instructions to remove the other optional backends if they are not needed.)

But again, now I noticed that instead of selecting dependencies to install by commenting/uncommenting lines build-args could be used. I try how that would work.

However, should we offer an image flavor in quay.io that includes Pecos (with a pecos tag or similar)?

osma · 2021-12-20T12:14:12Z

I'm sorry, I just caused a conflict in this PR by merging #544, which optimizes the startup time of Annif and contains a complete rewrite of annif/backend/__init__.py. You will need to adjust your code accordingly. The pattern should be quite obvious - just remember to:

keep alphabetical order (xtransformer should be the penultimate backend, just before yake)
look at the code for the other optional backends (fasttext, omikuji, nn_ensemble, yake) and use the same approach
add a unit test for the missing package into tests/test_backend.py

osma · 2022-01-18T15:02:54Z

I see that you fixed the conflicts and also addressed some other suggestions @mo-fu - great!
Could you give an example too:

Also, can you give an example for how to train and test this backend for example with the tutorial data sets?

sonarqubecloud · 2022-02-04T09:08:46Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

osma · 2022-02-04T11:20:23Z

But again, now I noticed that instead of selecting dependencies to install by commenting/uncommenting lines build-args could be used. I try how that would work.

This was already implemented in PR #548 (and extended for spaCy models in PR #527), both released as part of 0.56 and documented in the wiki. So we already have the mechanism in place for customizing Docker images with build args.

However, should we offer an image flavor in quay.io that includes Pecos (with a pecos tag or similar)?

Maybe - let's postpone that discussion a little bit.

mo-fu · 2022-03-14T08:36:18Z

A small status update. To figure out good parameter values for the YSO data set from the tutorial repository. Unfortunately the suggest command is really slow. Presumably since I run the parameter optimization on a GPU and copying the representation of every input individually takes a lot of time. I will try whether this changes when using the CPU. But maybe in the future it is possible to explore how parallel evaluation can be improved for backends that support it directly.

sonarqubecloud · 2022-03-14T08:42:12Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

osma · 2022-03-14T09:44:00Z

Thanks for the update @mo-fu !
It makes a lot of sense to do parallel evaluation (and other operations as well) directly with backends that support many documents at a time. I've had similar thoughts about other backends that could do this but the limiting factor is the current suggest method that only supports one document at a time. I created a new issue #579 for implementing a batched version of suggest.

Restore ability to use vocab language different from project language

…f AnnifVocabulary

…from-subject-module Suppress duplicate log messages from subject module

…r-py3.10 Fix slow CI/CD runs for Python 3.10

Make the prediction on the batch in one call of the NN model

…-suggestions Batch suggest in ensemble backends

sonarqubecloud · 2023-03-08T15:31:12Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

juhoinkinen · 2023-03-08T16:20:54Z

Sorry, I did not close this intentionally. I planned to test this (and maybe add batch processing, for which support has been recently implemented), so I merged current master to the PR branch, fixed conflicts, and pushed. GitHub showed very many commits that did not belong to this PR, and they could have been removed by switching the base branch to some other and then back (like in NatLibFi/FintoAI#12 (comment)).

So I switched the base branch to api-i18n, but then the PR got closed automatically, and I cannot reopen it, apparently because the PR and base branches share no history ("mo-fu:master branch has no history in common with NatLibFi:api-i18n"). And the base branch of closed PR cannot be changed...

I tried to merge current master to api-i18n, but that did not help. Maybe api-i18n could be completely rewritten to allow the PR to be reopened.

…ster" This reverts commit aa96ebc, reversing changes made to eb437a8.

juhoinkinen · 2023-03-09T10:57:59Z

As adviced in SO, for reopening I tried first merge the current base branch to the PR branch with the special option:
git pull origin api-i18n --allow-unrelated-histories

However, after that I cannot push to the branch anymore:

To github.com:mo-fu/Annif.git
 ! [remote rejected] master -> master (permission denied)
error: failed to push some refs to 'git@github.com:mo-fu/Annif.git'

Previously I could push, e.g. the merge of current master with conflict fixes. Maybe it is not possible to push to a branch of a PR that is closed, or it is because of the --allow-unrelated-histories option. Or @mo-fu could have the the needed permissions, the PR is coming from his fork after all.

I created the branch mo-fu-master, which has the commits of mo-fu:master, and my merge of master with conflict fixes. But a new PR cannot be opened from that branch, because

This branch is not ahead of the upstream master.

And actually the three-dot diff does not see anything, but two-dot diff does.

juhoinkinen · 2023-03-09T15:47:04Z

I asked from GitHub support if they can switch the base branch back to master to allow reopening this PR.

juhoinkinen · 2023-03-10T15:12:01Z

GitHub support responded:

It is not possible to re-open a PR once its merged and closed, unfortunately. This is the expected behaviour.

This is because merged and closed pull requests cannot be synchronized with the underlaying Git repository. You can only re-open a closed pull request if it contains un-merged commits.

You could open a new PR based on the same merged branch, if you push at least one new commit to the pull request branch.

They are answering my a bit another question (merged PR) than is the current situation and which I asked for (automatically closed apparently due to "unrelated histories" of branches).

At least I cannot anymore push to this PR branch, so unless mo-fu can somehow fix this, the other option is to open a new PR (there is the branch mo-fu-master in this Annif repository, from which a PR can now be opened, whereas yesterday it seemed not possible).

mo-fu · 2023-03-13T13:45:31Z

@juhoinkinen I added you as a collaborator to my fork. Maybe this helps. I can also try the git pull origin api-i18n --allow-unrelated-histories later.

juhoinkinen · 2023-03-13T19:39:37Z

Thanks @mo-fu, now I was able to do the git pull origin api-i18n --allow-unrelated-histories and push, but the commits did not appear in this PR. This seems different to the behavior that was reported in SO.

I also tried to use GitHub API to change the base branch back to master and re-open the PR, but with no luck, the response was: "Cannot change the base branch of a closed pull request."

It seems there is no way this PR can be re-opened :( Or maybe rebasing the whole mo-fu:master branch on NatLibFi:api-i18, so the branches would definitely have a common history...? But then it could be that GitHub just would not detect the change for this PR, because I think pulling api-i18n should already have made a common history.

The easiest way to proceed would be to make a new PR from the commit Merge branch 'master' of github.com:mo-fu/Annif into mo-fu-master, which has the (nearly) current master from NatLibFi merged (mo-fu:master could be just resetted to that commit to remove the merge of api-i18n).

juhoinkinen · 2023-04-04T07:13:12Z

The easiest way to proceed would be to make a new PR from the commit Merge branch 'master' of github.com:mo-fu/Annif into mo-fu-master, which has the (nearly) current master from NatLibFi merged (mo-fu:master could be just resetted to that commit to remove the merge of api-i18n).

Hi @mo-fu! Would you like to open a new PR for XTransformer? So that GitHub history will show credit of it to you. No need to mind about the PR description, we can edit it.

I could also open the PR from your fork, if you dont want to bother with this.

mo-fu force-pushed the master branch from beb9ea9 to d40c6bc Compare December 6, 2021 13:33

mo-fu force-pushed the master branch 2 times, most recently from 3cc99e4 to 637e0cf Compare December 6, 2021 17:07

osma reviewed Dec 17, 2021

View reviewed changes

mo-fu force-pushed the master branch from a684704 to 9fb64ac Compare January 6, 2022 09:50

mo-fu force-pushed the master branch from 9fb64ac to acc4af4 Compare February 4, 2022 09:08

osma mentioned this pull request Mar 14, 2022

Batch suggest operation #579

Closed

juhoinkinen and others added 10 commits August 23, 2022 17:47

Format REST method headers to bold

6e0f4e1

reorganize some tests and rename old dummydummy project

54f217b

further reorganization of test projects

0e7b4eb

restore ability to use vocab language different from project language

26c134d

Merge pull request NatLibFi#613 from NatLibFi/fix-vocab-language

576c7b7

Restore ability to use vocab language different from project language

refactor: move vocabulary language information to AnnifProject, out o…

6fbea80

…f AnnifVocabulary

Adjust docstrings of commands for displaying in documentation

c977874

remove useless print statement

d34de85

fix paths of vocabulary files to clean before test

e1ec72b

implement list-vocabs and load-vocab commands

82138df

juhoinkinen and others added 16 commits February 10, 2023 12:37

Make cache to be recreated if some dependency has been updated

39c46ea

Run superclass initialization in DuplicateFilter to make CodeQL happy

8991291

Avoid reusing logger variable

8d7e1c5

Merge pull request NatLibFi#673 from NatLibFi/supress-duplicate-logs-…

d8dd8db

…from-subject-module Suppress duplicate log messages from subject module

Make save cache step conditional on changes on dependencies/environment

c10792c

Merge pull request NatLibFi#672 from NatLibFi/fix-gh-actions-cache-fo…

3e8f42f

…r-py3.10 Fix slow CI/CD runs for Python 3.10

Support _suggest_batch operation in NN ensemble backend

16a58bb

Make the prediction on the batch in one call of the NN model

Use model's __call__ method for predictions

ba733c2

Use batch suggest calls to base projects

f46c2ad

Implement _suggest_batch in ensemble backend

1bfc738

Use batched suggest calls in EnsembleOptimizer

5e39256

Fix for forgotten function rename

5bdfb71

Make docstring & loop var more informative

42e1afc

Turn WeightedSuggestion to WeightedSuggestionsBatch

f280342

Merge pull request NatLibFi#677 from NatLibFi/batching-in-nn-ensemble…

eb437a8

…-suggestions Batch suggest in ensemble backends

Merge branch 'master' of github.com:mo-fu/Annif into mo-fu-master

aa96ebc

juhoinkinen changed the base branch from master to api-i18n March 8, 2023 15:30

juhoinkinen closed this Mar 8, 2023

juhoinkinen referenced this pull request Mar 9, 2023

Revert "Merge branch 'master' of github.com:mo-fu/Annif into mo-fu-ma…

b5a25f4

…ster" This reverts commit aa96ebc, reversing changes made to eb437a8.

juhoinkinen mentioned this pull request Jun 26, 2023

Add XTransformer backend #716

Closed

Lakshmi-bashyam mentioned this pull request Sep 16, 2024

Add Xtransformer to backend #798

Draft

juhoinkinen mentioned this pull request Sep 19, 2024

Automate NLTK datapackage punkt_tab download #803

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XTransformer backend #540

Add XTransformer backend #540

mo-fu commented Dec 6, 2021

codecov bot commented Dec 6, 2021 •

edited

Loading

lgtm-com bot commented Dec 6, 2021

lgtm-com bot commented Dec 6, 2021

lgtm-com bot commented Dec 6, 2021

osma commented Dec 17, 2021

osma Dec 17, 2021

juhoinkinen commented Dec 20, 2021

osma commented Dec 20, 2021

osma commented Jan 18, 2022

sonarqubecloud bot commented Feb 4, 2022

osma commented Feb 4, 2022

mo-fu commented Mar 14, 2022

sonarqubecloud bot commented Mar 14, 2022

osma commented Mar 14, 2022

sonarqubecloud bot commented Mar 8, 2023

juhoinkinen commented Mar 8, 2023

juhoinkinen commented Mar 9, 2023

juhoinkinen commented Mar 9, 2023

juhoinkinen commented Mar 10, 2023

mo-fu commented Mar 13, 2023

juhoinkinen commented Mar 13, 2023

juhoinkinen commented Apr 4, 2023

Add XTransformer backend #540

Add XTransformer backend #540

Conversation

mo-fu commented Dec 6, 2021

codecov bot commented Dec 6, 2021 • edited Loading

Codecov Report

lgtm-com bot commented Dec 6, 2021

lgtm-com bot commented Dec 6, 2021

lgtm-com bot commented Dec 6, 2021

osma commented Dec 17, 2021

osma Dec 17, 2021

Choose a reason for hiding this comment

juhoinkinen commented Dec 20, 2021

osma commented Dec 20, 2021

osma commented Jan 18, 2022

sonarqubecloud bot commented Feb 4, 2022

osma commented Feb 4, 2022

mo-fu commented Mar 14, 2022

sonarqubecloud bot commented Mar 14, 2022

osma commented Mar 14, 2022

sonarqubecloud bot commented Mar 8, 2023

juhoinkinen commented Mar 8, 2023

juhoinkinen commented Mar 9, 2023

juhoinkinen commented Mar 9, 2023

juhoinkinen commented Mar 10, 2023

mo-fu commented Mar 13, 2023

juhoinkinen commented Mar 13, 2023

juhoinkinen commented Apr 4, 2023

codecov bot commented Dec 6, 2021 •

edited

Loading