Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IEDB database cdr3_aa stored as junction_aa #469

Open
racng opened this issue Dec 14, 2023 · 10 comments
Open

IEDB database cdr3_aa stored as junction_aa #469

racng opened this issue Dec 14, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@racng
Copy link

racng commented Dec 14, 2023

Describe the bug
IEDB database provides cdr3_aa sequences instead of junction_aa sequences. scirpy database import code puts this in the slot for junction_aa. The sequence would then be missing the flanking amino acids. This makes it inaccurate to do identity string matching with ir_dist/ir_query.

To Reproduce
https://github.com/scverse/scirpy/blob/d862cf35740a79e91f95c49ce6da1fbe280f8c1c/src/scirpy/datasets/__init__.py#L338C1-L371C10

Expected behaviour
Is there a way to format cdr3 into junction sequence? like add the missing "C" at the beginning based on V call? Do you have advice on how to do this?

@racng racng added the bug Something isn't working label Dec 14, 2023
@grst
Copy link
Collaborator

grst commented Dec 15, 2023

Hi Rachel,

thanks for bringing this up. This was meant as a workaround to make it work with ir_dist (which is currently hardcoded to work on the junction sequence).

A proper solution is probably to

  • convert junction to cdr3 (which means just removing the constant amino acids)
  • provide an option in ir_dist to use the cdr3 column instead of junction_aa

For now, as a workaround, I would suggest to store CDR3 sequences in junction_aa in both query and database object.
You can trim the string with awkward array functions:

import awkward as ak
mdata["airr"].obsm["airr"]["junction_aa"] = ak.str.slice(mdata["airr"].obsm["airr"]["junction_aa"], 1, -1)

@grst
Copy link
Collaborator

grst commented Dec 15, 2023

@zktuong, just wanted to ask you to be sure

  • junction_aa -> cdr3 should be trivial, right? Just clip off the first and last amino acid. Any caveats?
  • cdr3 -> junction_aa: Is it correct that the first AA is always C and the last is always W/F? Can this be robustly inferred from the V call?

@zktuong
Copy link
Contributor

zktuong commented Dec 21, 2023

junction_aa -> cdr3 should be trivial, right? Just clip off the first and last amino acid. Any caveats?

yes should be correct.

cdr3 -> junction_aa: Is it correct that the first AA is always C and the last is always W/F? Can this be robustly inferred from the V call?

yes first is always C and the last should always be either F/W. Should be able to infer the F/W from the second codon from start of the J call (TTT/TTC for F or TGG for W).

@grst grst mentioned this issue Jan 9, 2024
3 tasks
@grst
Copy link
Collaborator

grst commented Jan 9, 2024

Hi @racng,

I believe I fixed this in #476, by extracting the junction_aa sequence from the full protein sequence based on the start/end coordinates provieded in IEDB. Would be great if you could give this a try!

You can install that version using

pip install git+https://github.com/scverse/scirpy@issue-469

Please make sure to remove the cached version of iedb before rerunning ir.datasets.iedb by deleting the data directory that was created.

@racng
Copy link
Author

racng commented Jan 10, 2024

@grst Thank you for working on a fix for this issue!
I have copied the new function into a script and tested it.
For some reason I am getting a lot of NaN values for both junction_aa and cdr3_aa.

adata = iedb(cached=True, cache_path='test/iedb.h5ad')

ir.get.airr(adata, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 30080

ir.get.airr(adata, 'cdr3_aa')['VDJ_1_cdr3_aa'].isna().sum()
# 30110

# Old copy of iedb reference
ir.get.airr(adata_old, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 118

@grst
Copy link
Collaborator

grst commented Jan 11, 2024

Bad news! I took another look and it seems that indeed the Start/End position and and Protein sequence are only available for ~5000 receptors:

>>> iedb_df["Chain 1 Protein Sequence"].dropna().size
5083

For the rest, there is "CDR3 Curated", but not "CDR3 Calculated" available. Taking a closer look at "CDR3 Curated", it seems that some (but not all) sequences there are actually junction sequences, including the C and W/F.

image

So I'm afraid this would take quite some cleanup to get it right!


On the scirpy side, I could consider adding the option to use cdr3_aa sequences, instead of junction_aa, for sequence comparisons. Then the CDR3 sequences in the database could remain as they are.

@racng
Copy link
Author

racng commented Jan 11, 2024

I think I found a solution. We can use the J-motif sequences in J-region-motifs.tsv generated by IMGTgeneDL run in stitchr mode to determine the terminal residue of the J gene. It "contains automatically inferred CDR3 junction ending motifs and residues (using the process established in the autoDCR TCR assignation tool)".

I am attaching it here:
J-region-motifs.txt

@grst
Copy link
Collaborator

grst commented Jan 12, 2024

@zktuong, what do you say about this one? Is that wrong or the always existing exception to the rule in Biology?
image

(L and V are "not confident" but C is)
(this is from above J-junction-motifs.txt)

@zktuong
Copy link
Contributor

zktuong commented Jan 18, 2024

Looks like they are correct and are treated as exceptions, but probably non-functional

For TRAJ35*01:

(6) noncanonical J-MOTIF: Cys-Gly-X-Gly instead of Phe-Gly-X-Gly. The functionality of TRAJ35*01, initially considered as ORF owing to the presence of Cys (C118) instead of Phe (F118) in the J-MOTIF, has been changed to F (functional) as TRAJ35*01 has been found rearranged and expressed in several specific cDNA clones (comments added by G. Folch and M.-P. Lefranc on the 23/01/2019, following the identification of an additional clone and a question by C. Joos (Germany)).

https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRAJ
Genecard says it's non-functional
https://www.genecards.org/cgi-bin/carddisp.pl?gene=TRAJ35

TRBJ2-2P*01 also looks like an exception and non-functional open reading frame

(1) This sequence has diverged considerably from that of other TRBJ2-2: it has no canonical J-NONAMER [2], J-PHE 118 is replaced by Gly (G) and G 119 by R in J-MOTIF.

https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRBJ
https://www.genecards.org/cgi-bin/carddisp.pl?gene=TRBJ2-2P

TRBJ2-7*02

(2) J-PHE replaced 118 replaced by Val in J-MOTIF.

@zktuong
Copy link
Contributor

zktuong commented Jan 18, 2024

i just checked the IgBLAST auxiliary files that indicate where the CDR3 end is and crossed it to the fasta files and it looks like the codons are correct for those 3 genes as well. so should be alright to use the J-junction-motifs.txt

@grst grst added this to scirpy-dev May 28, 2024
@grst grst moved this to On Hold in scirpy-dev May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: On Hold
Development

Successfully merging a pull request may close this issue.

3 participants