-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IEDB database cdr3_aa stored as junction_aa #469
Comments
Hi Rachel, thanks for bringing this up. This was meant as a workaround to make it work with A proper solution is probably to
For now, as a workaround, I would suggest to store CDR3 sequences in import awkward as ak
mdata["airr"].obsm["airr"]["junction_aa"] = ak.str.slice(mdata["airr"].obsm["airr"]["junction_aa"], 1, -1) |
@zktuong, just wanted to ask you to be sure
|
yes should be correct.
yes first is always C and the last should always be either F/W. Should be able to infer the F/W from the second codon from start of the J call (TTT/TTC for F or TGG for W). |
Hi @racng, I believe I fixed this in #476, by extracting the You can install that version using pip install git+https://github.com/scverse/scirpy@issue-469 Please make sure to remove the cached version of iedb before rerunning |
@grst Thank you for working on a fix for this issue! adata = iedb(cached=True, cache_path='test/iedb.h5ad')
ir.get.airr(adata, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 30080
ir.get.airr(adata, 'cdr3_aa')['VDJ_1_cdr3_aa'].isna().sum()
# 30110
# Old copy of iedb reference
ir.get.airr(adata_old, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 118 |
Bad news! I took another look and it seems that indeed the Start/End position and and Protein sequence are only available for ~5000 receptors: >>> iedb_df["Chain 1 Protein Sequence"].dropna().size
5083 For the rest, there is "CDR3 Curated", but not "CDR3 Calculated" available. Taking a closer look at "CDR3 Curated", it seems that some (but not all) sequences there are actually junction sequences, including the So I'm afraid this would take quite some cleanup to get it right! On the scirpy side, I could consider adding the option to use |
I think I found a solution. We can use the J-motif sequences in I am attaching it here: |
@zktuong, what do you say about this one? Is that wrong or the always existing exception to the rule in Biology? (L and V are "not confident" but C is) |
Looks like they are correct and are treated as exceptions, but probably non-functional For TRAJ35*01:
https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRAJ TRBJ2-2P*01 also looks like an exception and non-functional open reading frame
https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRBJ TRBJ2-7*02
|
i just checked the IgBLAST auxiliary files that indicate where the CDR3 end is and crossed it to the fasta files and it looks like the codons are correct for those 3 genes as well. so should be alright to use the |
Describe the bug
IEDB database provides cdr3_aa sequences instead of junction_aa sequences. scirpy database import code puts this in the slot for junction_aa. The sequence would then be missing the flanking amino acids. This makes it inaccurate to do identity string matching with ir_dist/ir_query.
To Reproduce
https://github.com/scverse/scirpy/blob/d862cf35740a79e91f95c49ce6da1fbe280f8c1c/src/scirpy/datasets/__init__.py#L338C1-L371C10
Expected behaviour
Is there a way to format cdr3 into junction sequence? like add the missing "C" at the beginning based on V call? Do you have advice on how to do this?
The text was updated successfully, but these errors were encountered: