Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Xtransformer to backend #798

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

Lakshmi-bashyam
Copy link

This PR adds xtransformer as an optional dependency, incorporating minor changes and updating the backend implementation to align with the latest Annif version, building on the previous xtransformer PR #540

annif/backend/xtransformer.py Fixed Show fixed Hide fixed
annif/backend/xtransformer.py Fixed Show fixed Hide fixed
annif/backend/xtransformer.py Fixed Show fixed Hide fixed
annif/backend/xtransformer.py Fixed Show fixed Hide fixed
annif/backend/xtransformer.py Fixed Show fixed Hide fixed
annif/util.py Fixed Show fixed Hide fixed
Copy link

codecov bot commented Sep 17, 2024

Codecov Report

Attention: Patch coverage is 36.10108% with 177 lines in your changes missing coverage. Please review.

Project coverage is 97.21%. Comparing base (125565e) to head (4c33a31).
Report is 49 commits behind head on main.

Files with missing lines Patch % Lines
annif/backend/xtransformer.py 7.36% 88 Missing ⚠️
tests/test_backend_xtransformer.py 9.27% 88 Missing ⚠️
annif/backend/__init__.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #798      +/-   ##
==========================================
- Coverage   99.65%   97.21%   -2.44%     
==========================================
  Files          91       95       +4     
  Lines        6886     7210     +324     
==========================================
+ Hits         6862     7009     +147     
- Misses         24      201     +177     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -95,6 +95,16 @@ def test_get_backend_yake_not_installed():
assert "YAKE not available" in str(excinfo.value)


@pytest.mark.skipif(
importlib.util.find_spec("pecos") is not None,
reason="test requires that YAKE is NOT installed",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PECOS, not YAKE, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes. Thanks for catching it.

@osma
Copy link
Member

osma commented Sep 25, 2024

Thanks a lot for this new PR @Lakshmi-bashyam ! It really helps to have a clean starting point based on the current code.

We've now tested this briefly. We used the PLC (YKL) classification task, because it seemed simpler than predicting YSO subjects and the current classification quality (mainly using Omikuji Parabel and Bonsai) are not that good, so it seems likely that a new algorithm could achieve better results. (And it did!)

I set this up in the University of Helsinki HPC environment. We got access to an A100 GPU (which is way overkill for this...) so it was possible to train and evaluate models in a reasonable time.

Here are some notes, comments and observations:

Default BERT model missing

Training a model without setting model_shortcut didn't work for me. Apparently the model distilbert-base-multilingual-uncased cannot be found on HuggingFace Hub (maybe it has been deleted?). I set model_shortcut="distilbert-base-multilingual-cased" and it started working. (Later I changed to another BERT model, see below)

Documentation and advice

There was some advice and a suggested config in this comment from Moritz. I think we would need something like this to guide users (including us at NLF!) on how to use the backend and what configuration settings to use. Eventually this could be a wiki page for the backend like the others we have already, but for now just a comment in this PR would be helpful for testing.

Here is the config I currently use for the YKL classification task in Finnish:

[ykl-xtransformer-fi]
name="YKL XTransformer Finnish"
language="fi"
backend="xtransformer"
analyzer="simplemma(fi)"
vocab="ykl"
batch_size=16
truncate_length=256
learning_rate=0.0001
num_train_epochs=3
max_leaf_size=18000
model_shortcut="TurkuNLP/bert-base-finnish-cased-v1"

Using the Finnish BERT model improved results a bit compared to the multilingual BERT model. It's a little slower and takes slightly more VRAM (7GB instead of 6GB in this task), probably because it's not a DistilBERT model.

This configuration achieves a Precision@1 score of 0.59 on the Finnish YKL classification task, which is slightly higher than what we get with Parabel and Bonsai (0.56-0.57).

If you have any insight in how to choose appropriate configuration settings based on e.g. the training data size, vocabulary size, task type, available hardware etc. then that would be very valuable to include in the documentation. Pecos has tons of hyperparameters!

Example questions that I wonder about:

  1. Does the analyzer setting affect what the BERT model sees? I don't think so?
  2. How to select the number of epochs? (so far I've tried 1, 2 and 3 and got the best results with 3 epochs)
  3. How to set truncate_length and what is the maximum value? Can I increase it from 256 if my documents are longer than this?
  4. How to set max_leaf_size?
  5. How to set batch_size?
  6. Are there other important settings/hyperparameters that could be tuned for better results?

Pecos FutureWarning

I saw this warning a lot:

/home/xxx/.cache/pypoetry/virtualenvs/annif-fDHejL2r-py3.10/lib/python3.10/site-packages/pecos/xmc/xtransformer/matcher.py:411: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.

However, I think this is a problem in Pecos and probably not something we can easily fix ourselves. Maybe it will be fixed in a later release of Pecos. (I used libpecos 1.25 which is currently the most recent release on PyPI)

Not working under Python 3.11

I first tried Python 3.11, but it seemed that there was no libpecos wheel for this Python version available on PyPI (and it couldn't be built automatically for some reason). So I switched to Python 3.10 for my tests. Again, this is really a problem with libpecos and not with the backend itself.

Unit tests not run under CI

The current tests seem to do a lot of mocking to avoid actually training models. This is probably sensible since actually training a model could require lots of resources. However, the end result is that test coverage is quite low, with less than 10% of lines covered.

Looking more closely, it seems like most of the tests aren't currently executed at all under GitHub Actions CI. I suspect this is because this is an optional dependency and it's not installed at all under the CI environment, so the tests will be skipped. Fixing this in the CI config (.github/workflows/cicd.html) should at least substantially improve the test coverage.

Code style and QA issues

There are some complaints from QA tools about the current code. These should be easy to fix. Not super urgent, but they should be fixed before we can consider merging this. (If some things are hard to fix we can reconsider them case by case)

  • Lint with Black fails in the CI run. The code doesn't follow Black style. Easy to fix by running black
  • SonarCloud complains about a few variable names and return types
  • github-advanced-security complains about imports (see previous comment above)

Dependency on PyTorch

Installing this optional dependency brings in a lot of dependencies, including PyTorch and CUDA. The virtualenv in my case (using poetry install --all-extras) is 5.7GB, while another one for the main branch (without pecos) is 2.6GB, an increase of over 3GB. I wonder if there is any way to reduce this? Especially if we want to include this in the Docker images, the huge size could become a problem.

Also, the NN ensemble backend is implemented using TensorFlow. It seems a bit wasteful to depend on both TensorFlow and PyTorch. Do you think it would make sense to try to reimplement the NN ensemble in PyTorch? This way we could at least drop the dependency on TensorFlow.


Again, thanks a lot for this and apologies for the long silence and the long comments! We can of course do some of the remaining work to get this integrated and merged on our side, because this seems like a very useful addition to the Annif backends. Even if you don't have any time to work on the code, just providing some advice on the configuration side would help a lot! For example, example configurations you've used at ZBW would be nice to see.

Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
11.5% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

@juhoinkinen
Copy link
Member

Especially if we want to include this in the Docker images, the huge size could become a problem.

I build a Dockerimage from this branch, and its size is 7.21 GB, which is quite much bigger than the size of Annif 1.1 image, which is 2.07 GB.

Not all users and use cases probably won't need Xtransformer, or other optional dependencies, so we could build different variants of the image and push them to quay.io (just by setting different buildargs in GitHub Actions build step and tagging the images appropriately). But that can be done in separate PR; I'll create an issue for this now.

@katjakon
Copy link

katjakon commented Oct 1, 2024

Hello,
Thank you for your work on this PR!
At the German National Library, we are also experimenting with XR-Transformer. We would be glad to contribute, especially with regards to documentation and training advice.

A good starting point might be the hyperparameters used in the original paper. They can be found here. Different settings were used for different datasets.

We also observed that the choice of Transformer model can have an impact on the results. In the original paper and in our experiments, Roberta model performed well. We used xml-roberta-base. It is a multilingual model which was trained on 100 languages.

Are there other important settings/hyperparameters that could be tuned for better results?

We found that tuning the hyperparameters associated with the Partitioned Label Tree (known as Indexer in XR-Transformer) and the hyperparameters of the OVA classifiers (known as Ranker in XR-Transformer) led to notable improvements in our results. In particular:

  • nr_splits (& min_codes): Number of child nodes. This hyperparameter can be compared to cluster_k in Omikuji. For us, bigger values like 256 led to better results.
  • max_leaf_size: We observed that bigger values perform better. We currently use 400.
  • Cp & Cn are the costs for wrongly classified labels used in the OVA classifiers. Cp is the cost for wrongly classified positive labels, Cn is the cost for negative labels. Using different penalities for positive and negative labels is especially helpful when labels are imbalanced, which is probably the case for OVA classifiers. These hyperparameters had a huge influence on our results. Further reading
  • threshold: A regularisation method. Model weights in the OVA classifiers that fall below the threshold are set to zero. Choosing a high value here will reduce model size, but might lead to a model that is underfitting. Choosing a very low value might lead to overfitting. We achieve good performance with 0.015.

As far as I can tell, some of these are not currently integrated in the PR here.

How to set truncate_length and what is the maximum value? Can I increase it from 256 if my documents are longer than this?

The maximum length of the transformer model limits this. For instance, for BERT this is 512. The authors noted that there was no significant performance increase when using 512, and we observed the same thing.

How to set batch_size?

This also depends on how big of a batch fits the memory requirements of GPUs/CPUs that is used. Generally, starting out with a value like 32 or 64 works well, increasing it (if possible) to see if this leads to improvements. I also found this forum exchange where it's stated that:

Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the cost of noise in the training process.
Large values give a learning process that converges slowly with accurate estimates of the error gradient.

I have attached the hyperparamter configuration file that we currently use. Even though we don't use Annif in our experiments, I hope this can still provide some helpful insights. params.txt

I am happy to answer any questions and contribute to the Wiki if needed!

"max_active_matching_labels": int,
"max_num_labels_in_gpu": int,
"use_gpu": boolean,
"bootstrap_model": str,
Copy link

@katjakon katjakon Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding my previous comments about hyperparamters: it should be fairly easy to incorporate additonal hyperparameters:
Adding the following lines to PARAM_CONFIG would allow us to make use of the hyperparamters Cp and Cn in the project configurations:

"Cn": float,
"Cp": float,

And similarly for the dict DEFAULT_PARAMETERS:

"Cn": 1.0,
"Cp": 1.0,

Let me know if there are any questions!

@katjakon
Copy link

Validation Data during Training

I've been testing this Annif version with XTransformer and so far it's working pretty well. Thanks again!
However, I noticed that no validation data is used during training. I think validation is crucial for XTransformer to avoid overfitting and to save only the best performing model checkpoints.
Is there any way to include a validation file, especially when using the annif train command?
I would appreciate any comments or hints!

@osma
Copy link
Member

osma commented Nov 21, 2024

Thank you very much @katjakon for your very insightful comments!

@mfakaehler
Copy link

I have just discussed the options about integrating validation data into the backend with @katjakon. I agree with Katja, that avoiding overfitting in the training process is crucial.
We see two options:

  • a) add another argument to annif train, to allow passing on a separate validation-dataset by the user
  • b) implement a splitting procedure as part of the backend
    My colleagues that practically operate annif at DNB usually have a validate-split in their data management. So I think at DNB option a) would be feasable. Any opinions on this?

@osma
Copy link
Member

osma commented Nov 27, 2024

Thanks for your insight @mfakaehler and @katjakon ! I agree that making it possible to provide a separate validation data set during XTransformer training makes sense. But the CLI would have to accommodate this.

Already the annif train command can take any number of paths (so you can pass multiple train files/directories), so adding another positional argument isn't easy. However, there could be a new option such as --validate that could be given a path to a validation data set (it could even be repeated, I think, if there's a need for passing multiple paths). So the train command could look like this:

annif train my-xtransformer --validate validate.tsv.gz train1.tsv.gz train2.tsv.gz train3.tsv.gz

Then the question becomes: should --validate be a required parameter when training XTransformer? Or would the backend in that case perform the split on its own? (perhaps defaulting to e.g. 10% for validation, with another option to override the fraction).

@mfakaehler
Copy link

A default logic like

  • user provides validate Argument
  • if not: user provides splitting fraction
  • if not: splitting fraction is set to 10%

as you suggested, seems plausible to me!
Let's wait until @Lakshmi-bashyam returns and see if there is an argument for the usecase of no validation data. Maybe there is also a need for that, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants