-
Notifications
You must be signed in to change notification settings - Fork 6
Exitron and exitron derived neoantigen identification with ScanExitron and ScanNeo
Thank you for your interest in using ScanExitron to identify exitron splicing events and ScanNeo to identify exitron-derived neoantigens in your RNA-seq dataset. In this tutorial, I am going to show you how to prepare data, select appropriate parameters, and run ScanExitron and ScanNeo using a toy example dataset and a real dataset. While this vignette accurately portrays the steps I would take to run the software and analyze the results, I do not go into detail on all of the parameters and limitations you should consider when running this on your data. Before starting, please read the original paper Wang et al., 2021.
Through this tutorial I will show you the following things:
- Data preparation
- Software installation
- Preparing the reference genome sequences and gene annotation files
- Run ScanExitron for the example dataset
- Run ScanNeo for the example dataset
- Run this protocol for the TCGA-PRAD dataset
Example RNA-seq data is available at GitHub example.bam, example.bai
TCGA PRAD RNA-seq dataset is available at NCI Genomic Data Commons. The metainfo for this dataset and downloading commands are available at GitHub TCGA_data. But before you start the downloading work, you have to have an eRA commons account. Once you got one, you can login NCI Genomic Data Commons to obtain an authentication token gdc-user-token.txt so as to download GDC controlled data such as PRAD RNA-seq data.
You can create a folder named BAMs. Then run obtain_bam.sh to download PRAD RNA-seq BAM files. Here shows the first two lines in "obtain_bam.sh", as followed.
token=$(<gdc-user-token.txt)
curl --header "X-Auth-Token: $token" 'https://api.gdc.cancer.gov/data/21ebc131-0bdc-4dcb-900f-6da23466ad6a' --output BAMs/TCGA-EJ-A65D-01A-11R-A30B-07.bam && samtools index BAMs/TCGA-EJ-A65D-01A-11R-A30B-07.bam BAMs/TCGA-EJ-A65D-01A-11R-A30B-07.bai
HLA class I four-digit types of 495 out of 496 TCGA PRAD samples were obtained from (Thorsson et al., 2018) (https://gdc.cancer.gov/about-data/publications/panimmune).
token=$(<gdc-user-token.txt)
curl --header "X-Auth-Token: $token" 'https://api.gdc.cancer.gov/data/2cf05dd5-9653-497a-8c7e-45ba0d1d237a' --output OptiTypeCallsHLA_20171207.tsv
For the last remaining sample used in this study, ScanNeo was employed for HLA class I typing.
ScanNeo_utils.py TCGA-M7-A725-01A-12R-A32O-07.bam
HLA class I four-digit type for this sample will be "HLA-A*31:01,HLA-C*17:01,HLA-B*40:01,HLA-A*30:04,HLA-C*03:04,HLA-B*41:01".
-
Install ScanExitron dependencies
- Install RegTools v0.5.0
conda install -c bioconda regtools=0.5.0
- Install other dependent packages via conda.
conda install -c bioconda samtools betools pyfaidx bedops
-
Install ScanExitron by running the following code:
git clone https://github.com/ylab-hi/ScanExitron.git
Add the directory of ScanExitron to the $PATH environment variable.
-
Install ScanNeo dependencies
- Install transIndel v2.0
$ git clone https://github.com/cauyrd/transIndel
Add the directory of transIndel_build_RNA.py and transIndel.py to the $PATH environment variable.
-
Install other dependent packages via conda.
conda install -c bioconda optitype ensembl-vep sambamba bedtools picard bwa yara razers3 pyfaidx pyvcf
conda install -c conda-forge coincbc
conda install -c anaconda hdf5
- configure VEP, optitype and yara index according to the ScanNeo manual.
-
Install ScanNeo by running the following code:
$ git clone https://github.com/ylab-hi/ScanNeo.git
Add the directory of ScanNeo to the $PATH environment variable.
ScanExitron utilized annotated coding sequence (CDS) to probe the exitrons, and it also extracted splice sites using the reference genome sequences. The human reference genome sequences and gene annotation will be used for ScanExitron.
ScanNeo utilized reference genome sequences to update the reference and alternate allele sequences to each exitron splicing event.
For hg38 reference genome, you can run the following commands for preparation.
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_21/gencode.v21.annotation.gtf.gz
gunzip gencode.v21.annotation.gtf.gz
grep -P "\tCDS\t" gencode.v21.annotation.gtf | gtf2bed > gencode.hg38.CDS.bed
For hg19 reference genome, you can run the following commands for preparation.
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz
gunzip hg19.fa.gz
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
gunzip gencode.v19.annotation.gtf.gz
grep -P "\tCDS\t" gencode.v19.annotation.gtf | gtf2bed > gencode.hg19.CDS.bed
- Run ScanExitron with the following command
ScanExitron.py -i example.bam -s 0 --ao 3 --pso 0.05 -m 50 -r hg38
Note: In practice, the different parameter settings will result in the different number of exitrons identified. For example, if you set a higher alternate allele observation (AO) and percent spliced out (PSO), you will get a smaller number of exitrons. Additional details for running ScanExitron and updates to the parameters can be found at the project GitHub repository.
Multiple files will be generated in this step, including
- “example.hq.bam”
- “example.hq.bam.bai”
- “example.hq.janno”
- “example.exitron”
"example.hq.bam" is the mapping quality filtered (MAPQ>50) BAM file. "example.hq.janno" is annotated junction file. For a more detailed explanation, please refer to RegTools. Researchers may use them for other purpose of interest, so we kept them.
The identified exitrons are stored in the example.exitron file. In order to feed the ScanExitron results to ScanNeo, output files of ScanExitron are required to be converted to VCF format using the utility script named exitron2vcf.py contained in the ScanExitron utils folder with the following command:
exitron2vcf.py -i example.exitron -o example.vcf
The output VCF file is available at ScanExiton Github repository
- ScanNeo first added corresponding reference and alternate allele sequences to each EIS event. Next, these events were annotated with variant effect predictor (VEP) McLaren et al., 2016. Run this annotation step of ScanNeo using the following command.
ScanNeo.py anno -i example.vcf -o example.vep.vcf
- Neoantigen prediction step of ScanNeo used VEP annotated VCF file as input to predict neoantigens using the following command.
ScanNeo.py hla -i example.vep.vcf --alleles HLA-A*68:02,HLA-A*23:01,HLA-B*07:02,HLA-B*53:01,HLA-C*07:02,HLA-C*04:01 -t 16 --af PSO -e 9 -p /path/to/iedb/ -o example.tsv
The putative exitron-derived neoantigens are stored in the example.tsv file.
-
In the Data Preparation step, you already downloaded data in the BAMs folder. You can run gen_EIS_cmd.py to generate the commands for exitron calling with ScanExitron. You'd better run the commands in a queue system such as Slurm.
-
According to the metainfo in PRAD.info.txt, you can separate the exitrons identified into two groups (tumors and normals), and put them into these two folders(tumors and normals).
-
In the normals directory, we then used normal_sample_count.py to obtain the number of normal samples for every exitron identified in normal samples. The summarized results are available in normal.stats.txt.
-
We can begin to call tumor-specific exitrons(TSEs), by excluding the EIS events in tumor samples that were also found in more than three normal samples. This action has been achieved using gen_TSEs.py. In order to feed the TSEs from ScanExitron to ScanNeo, output files of ScanExitron are required to be converted to VCF format using the utility script named exitron2vcf.py. The directory of exitron2vcf.py should be in the $PATH environment variable.
- Create a new folder named VCFs, and run the exitron2vcf.sh which was generated by gen_exitron2vcf_cmd.py.
- In the VCFs folder, run gen_anno.py to generate the shell script for ScanNeo VCF annotation step. Run this shell script to generate a list of VEP annotated VCF files for TSEs identified (*.vep.vcf).
- Create a new folder named neoantigens. You can run gen_neoantigen.py to generate the shell script for ScanNeo neoantigen prediction step. Run this shell script to generate exitron-derived neoantigens for PRAD cohort.
Note: We didn't provide PRAD.info.with.HLA.txt used in gen_neoantigen.py here. Because it contains the HLA class I types for the PRAD cohort, which is a controlled dataset by NCI. You can follow the instructions in the Data preparation section to obtain it. Or if you do have difficulty in obtaining the four-digit HLA type data, you may contact the authors for another option.
Thank you for using this tutorial!
If you have any questions, please do not hesitate to contact me through GitHub or at my email: