TCR-Embeddings

This repository aims to provide a framework for evaluating TCR Embeddings' expressivity with aim to understand the best way to map the functional landscape of T cell receptors.

Currently, support has been implemented for the following:

Physico-chemical Embeddings: Atchley Factors [1], Kidera Factors [2], Amino Acid Properties [3], Random Embeddings (for control)
LLM Embeddings: SCEPTR [4], TCR-BERT [5]

The code provides a flexible implementation with datasets, embedding models and hyperparameters to train models and subsequently using scripts to understand the effectiveness of the embeddings.

Should any user wishes to add support to more embedding models permanently, branching & submitting git merge requests are more than welcome, subject to satisfying Continuous-Integration requirements, detailed in subsequently sections.

Note

If I have not responded to merge requests after some time, please feel free to contact me.

Installation Instructions

conda create --name tcr_embeddings python=3.12
conda activate tcr_embeddings
python -m pip install poetry
poetry install
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Usage Instructions (Training Models)

Default training hyperparameters have been placed under tcr_embeddings/constants.json. You may change the hyperparameters of the training loop at your discretion.

Tip

You may want to disable cuda for training fairly small models as GPU acceleration may not be beneficial for models with a small amount of parameters.

Tip

Do not change runtime_constants.py

In your first-ever execution, run the following to generate the configuration file.

python -m tcr_embeddings.trainer --make

This generates a config.json on your local directory. In that file, you can change the more frequently changed parameters such as which fold within the k-Fold to run, which embedding method, reduction method, etc.

Using your own Datasets

For any dataset, please clean the data such that the following is satisfied:

The data is in a tsv file format, the only difference to a csv is that a tsv uses tabs to seperate columns where csvs use commas.
The data has columns "TRAV", "TRBV", "CDR3A", "CDR3B". You may put more columns in there at your discretion but this will slow down the execution through the data reading process. Two sample files have been placed here and here.
Place your data in directory: data/location. You may have multiple directories where the data shares the same label.

Tip

To export your file to a csv, you can use df.to_csv("<filename>.tsv", sep="\t").

To generate consistent K-Fold cross validation sets (such that all training instances uses the same K-Fold), run the following after placing your data in the right location:

python -m data.create-kfold

Downloading Embeddings

For TCR-BERT:

python -m tcr_embeddings.embed.download-tcrbert
For (re)creating Random Embeddings:

python -m tcr_embeddings.embed.create_random

Note

You cannot use the embedding method if you have not downloaded the embedding method.

Training your Model

To train a model, run the following:

python -m tcr_embeddings.trainer --config <path/to/your/config>/config.json

The training logs will be found within your specified output path in config.json, where you can generate the config.json file with python -m tcr_embeddings.trainer --make.

If your training has been paused / terminated in the middle and you wish to resume, run the following:

python -m tcr_embeddings.resume --dir path/to/your/resuming_instance

Inside the output path, you will find the following:

Checkpoints for each epoch, alongside with the training loss csv, validation loss csv and testing loss csv
K-Folds used
Training Logs
Parquet Files of TCRs with non-zero weights, where the repertoire has been correctly classified within the test set.

Usage Instructions (Analysis of Results)

We provide a few scripts to analyse your results, where they are placed under analysis. The Jupyter notebooks provides already comprehensive description on what they are doing, however this is an overview:

CI/CD

To maintain good standards of code, the following CI/CD procedures must be complied and followed. Unittests must be properly designed and covers edge cases. CI procedures within the workflow must be fully passed prior to any origin/master branch merge requests.

black > isort > flake8 > mypy > unittest / pytest

Details for CI/CD can be found here

References

Atchley, W.R., Zhao, J., Fernandes, A.D., Druke, T.: Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences 102(18), 6395–6400 (2005) https://doi.org/10.1073/pnas.0408677102
Kidera, A., Konishi, Y., Oka, M., Ooi, T., Scheraga, H.A.: Statistical analysis of the physical properties of the 20 naturally occurring amino acids. Journal of Protein Chemistry 4(1), 23–55 (1985) https://doi.org/10.1007/bf01025492
Elhanati, Y., Sethna, Z., Marcou, Q., Callan, C.G., Mora, T., Walczak, A.M.: Inferring processes underlying b-cell repertoire diversity. Philosophical Transactions of the Royal Society B: Biological Sciences 370(1676), 20140243 (2015) https://doi.org/10.1098/rstb.2014.0243
Nagano, Y., Pyo, A., Milighetti, M., Henderson, J., Shawe-Taylor, J., Chain, B., Tiffeau-Mayer, A.: Contrastive learning of T cell receptor representations (2024). https://arxiv.org/abs/2406.06397
Wu, K., Yost, K.E., Daniel, B., Belk, J.A., Xia, Y., Egawa, T., Satpathy, A., Chang, H.Y., Zou, J.: Tcr-bert: learning the grammar of t-cell receptors for flexible antigen-xbinding analyses. TCR-Bert: Learning the grammar of T-cell receptors for flexible antigen-xbinding analyses (2021) https://doi.org/10.1101/2021.11.18.469186

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
.github/workflows		.github/workflows
analysis		analysis
data		data
tcr_embeddings		tcr_embeddings
test		test
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TCR-Embeddings

Installation Instructions

Usage Instructions (Training Models)

Using your own Datasets

Downloading Embeddings

Training your Model

Usage Instructions (Analysis of Results)

CI/CD

References

About

Releases

Packages

Languages

License

RcwYuen/TCR-Embeddings

Folders and files

Latest commit

History

Repository files navigation

TCR-Embeddings

Installation Instructions

Usage Instructions (Training Models)

Using your own Datasets

Downloading Embeddings

Training your Model

Usage Instructions (Analysis of Results)

CI/CD

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages