ScanNeo

🔥NEW🔥: ScanNeo supports both human and mouse genome.

Introduction

A pipeline for identifying indel derived neoantigens using RNA-Seq data

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

You need Python 3.7 or above to run ScanNeo.

install necessary python packages via anaconda

Install anaconda (python 3.7) firstly.

Make sure you have added Anaconda bin directory to the $PATH environment variable in the file ~/.bashrc. Otherwise, please add it the the $PATH by editing .bashrc file, manually.

export PATH="/home/usr/Python3/bin:$PATH"

You can print the current value of the PATH variable by echoing the PATH environment variable:

echo $PATH

Then, install dependent packages via conda in bioconda channel.

conda install -c bioconda optitype # this step will automatically install razers3, pyomo and glpk
conda install -c bioconda ensembl-vep
conda install -c bioconda sambamba
conda install -c bioconda bedtools
conda install -c bioconda picard
conda install -c bioconda bwa
conda install -c bioconda yara
conda install -c bioconda razers3
conda install -c bioconda tabix
conda install -c conda-forge coincbc
conda install -c conda-forge pyomo
conda install -c bioconda pyfaidx
conda install -c conda-forge pyvcf
conda install -c anaconda hdf5

The installed executables will be available when you type their names in the Linux Shell, such as vep_install, sambamba, bwa and picard.

install transIndel

git clone https://github.com/cauyrd/transIndel

Place transIndel_build_RNA.py and transIndel_call.py in your folder in PATH.

intall IEDB binding prediction tools

Download the archives for HLA class I and unpack them

wget -c https://downloads.iedb.org/tools/mhci/3.1/IEDB_MHC_I-3.1.tar.gz
tar -zxvf IEDB_MHC_I-3.1.tar.gz
cd mhc_i
./configure

ScanNeo is only compatible with IEDB 3.1 and above now.

tcsh and gawk are needed for IEDB binding prediction tools. You have to install them if they are not available.

Install them on Debian/Ubuntu/Mint Linux. $ sudo apt-get install tcsh gawk

Install them on CentOS/RHEL. # yum install tcsh gawk

Or install them via conda conda install -c conda-forge tcsh gawk

install VEP annotations and reference fasta

For Human genome

vep_install -a cf -s homo_sapiens -y GRCh38 --CONVERT # hg38
vep_install -a cf -s homo_sapiens -y GRCh37 --CONVERT # hg19

For Mouse genome

vep_install -a cf -s mus_musculus -y GRCm39 --CONVERT # mm39
vep_install -a cf -s mus_musculus -y GRCm38 --CONVERT # mm10

VEP annotations will be installed at ~/.vep be default.

install VEP plugins

git clone https://github.com/ylab-hi/ScanNeo.git
cd VEP_plugins
cp Downstream.pm ~/.vep/Plugins
cp Wildtype.pm ~/.vep/Plugins

Configuration

configure Optitype

Make modification on OptiTypePipeline.py Change from

this_dir = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(os.path.realpath(__file__))))

to

this_dir = os.path.dirname(os.path.realpath(__file__))

configure config.ini of Optitype

[mapping]

# Absolute path to RazerS3 binary, and number of threads to use for mapping
razers3=/path/to/anaconda3/bin/razers3
threads=16

[ilp]
# A Pyomo-supported ILP solver. The solver must be globally accessible in the
# environment OptiType is run, so make sure to include it in PATH.
# Note: this is NOT a path to the solver binary, but a keyword argument for
# Pyomo. Examples: glpk, cplex, cbc.
# Optitype uses glpk by default, but we recommand use cbc instead

solver=cbc
threads=1

configure yara

Index the HLA reference genome hla_reference_rna.fasta from Optitype path/to/anaconda3/bin/data/hla_reference_rna.fasta

yara_indexer hla_reference_rna.fasta -o hla.index

By executing this command, it will generate 12 files, namely,

hla.index.lf.drp
hla.index.lf.drv
hla.index.rid.concat
hla.index.sa.ind
hla.index.sa.val
hla.index.txt.limits
hla.index.lf.drs
hla.index.lf.pst
hla.index.rid.limits
hla.index.sa.len
hla.index.txt.concat
hla.index.txt.size

configure config.ini file

[fasta]

# reference genome file in FASTA format

hg38=/path/to/hg38.fa
hg19=/path/to/hg19.fa
mm39=/path/to/mm39.fa
mm10=/path/to/mm10.fa

[annotation]

# gene annotation file in GTF format

hg38=/path/to/gencode.v21.annotation.gtf
hg19=/path/to/gencode.v19.annotation.gtf
mm39=/path/to/gencode.vM27.annotation.gtf
mm10=/path/to/gencode.vM23.annotation.gtf


[yara]

# yara index for hla_reference_rna.fasta

index=/path/to/hla.index

# the number of threads to use
threads=8

Usage

STEP 1: INDEL calling using RNA-seq data

ScanNeo.py indel -i rnaseq_bam -r hg38

Options:

-h, --help            show this help message and exit
-i INPUT, --input INPUT
                        RNA-seq alignment file (BAM)
--mapq                  Remove reads with MAPQ = 0
-r {hg19,hg38,mm39,mm10}, --ref {hg19,hg38,mm39,mm10}
                        reference genome (default: hg38)

Input:

input_bam_file      :input RNA-seq BAM file. (e.g., rna-seq.bam)
reference_genome    :specify reference genome (hg19 or hg38)

Output:


your_output_bam_file		:BAM file for CIGAR string redefinement. (rna-seq.indel.bam)
vcf_file			:Reported Indels with VCF format. (rna-seq.indel.vcf)

STEP 2: indels annotation using VEP

ScanNeo.py anno -i input_vcf_file -o output_annotated_vcf_file [options]

Options:

-h, --help            show this help message and exit
-i INPUT, --input INPUT
                        input VCF file
-c CUTOFF, --cutoff CUTOFF
                        MAF cutoff default: 0.01
-s, --slippage          Keep putative PCR slippage derived indels
-d DIR, --dir DIR     Specify the VEP cache/plugin directory to use.
                        (default: $HOME/.vep/)
-r {hg19,hg38,mm39,mm10}, --ref {hg19,hg38,mm39,mm10}
                        reference genome (default: hg38)
-o OUTPUT, --output OUTPUT
                        output annotated and filtered vcf file (default:
                        output.vcf)

Input:

input_vcf_file   	        :input VCF file is produced by ScanNeo indel
cutoff   			:MAF cutoff according to 1000 genome project and gnomAD project
reference_genome                :specify reference genome (hg19 or hg38)

Output:

output_annotated_vcf_file   			:VEP annotated Indels with VCF format

STEP 3: neoantigen prediction

ScanNeo.py hla -i vep.vcf --alleles HLA-A*02:01,HLA-B*08:01,HLA-C*03:03 -e 8,9 -o output.tsv [options] # for human
ScanNeo.py hla -i vep.vcf --alleles H-2-Dd,H-2-Kk,H-2-Ld -e 8,9 -o output.tsv [options] # for mouse
ScanNeo.py hla -i vep.vcf -b RNA_seq.bam -e 8,9 -o output.tsv [options]

Options:

-h, --help            show this help message and exit
-i VCF, --input VCF   VEP annotated and filtered VCF
--alleles ALLELES     Name of the allele to use for epitope prediction.
                        Multiple alleles can be specified using a comma-
                        separated listinput HLA class I alleles
-b BAM, --bam BAM     Input RNA-Seq BAM file if you don't know sample HLA class I alleles. Works for human only!
-l LENGTH, --length LENGTH
                         Length of the peptide sequence to use when creating
                         the FASTA (default: 21)
--binding BINDING_THRESHOLD
                         binding threshold ic50 (default: 500 nM)
--af AF_FIELD            The field name for allele frequency in VCF (default: AB)
-e EPITOPE_LENGTHS, --epitope-length EPITOPE_LENGTHS
                         Length of subpeptides (neoepitopes) to predict.
                         Multiple epitope lengths can be specified using a
                         comma-separated list. Typical epitope lengths vary
                         between 8-11. (default: 8,9,10,11)
-p PATH_TO_IEDB, --path-to-iedb PATH_TO_IEDB
                         Directory that contains the local installation of IEDB

-m {lowest,median}, --metric {lowest,median}
                        The ic50 scoring metric to use when filtering epitopes
                        by binding-threshold lowest: Best MT Score - lowest MT
                        ic50 binding score of all chosen prediction methods.
                        median: Median MT Score - median MT ic50 binding score
                        of all chosen prediction methods. (default: lowest)
-o OUTPUT, --output OUTPUT
                       output text file name, name.tsv

Output:

name.tsv file contains neoantigen prediction results

Report Columns

Column Name	Description
Chromosome	The chromosome of this variant
Start	The start position of this variant in the zero-based, half-open coordinate system
Stop	The stop position of this variant in the zero-based, half-open coordinate system
Reference	The reference allele
Variant	The alternate allele
Transcript	The Ensembl ID of the affected transcript
Ensembl Gene ID	The Ensembl ID of the affected gene
Variant Type	The type of variant. missense for missense mutations, inframe_ins for inframe insertions, inframe_del for inframe deletions, and FS for frameshift variants
Mutation	The amnio acid change of this mutation
Protein Position	The protein position of the mutation
Gene Name	The Ensembl gene name of the affected gene
HLA Allele	The HLA allele for this prediction
Peptide Length	The peptide length of the epitope
Sub-peptide Position	The one-based position of the epitope in the protein sequence used to make the prediction
Mutation Position	The one-based position of the start of the mutation in the epitope. 0 if the start of the mutation is before the epitope
MT Epitope Seq	Mutant epitope sequence
WT Epitope Seq	Wildtype (reference) epitope sequence at the same position in the full protein sequence. NA if there is no wildtype sequence at this position or if more than half of the amino acids of the mutant epitope are mutated
Best MT Score Method	Prediction algorithm with the lowest mutant ic50 binding affinity for this epitope
Best MT Score	Lowest ic50 binding affinity of all prediction algorithms used
Corresponding WT Score	ic50 binding affinity of the wildtype epitope. NA if there is no WT Epitope Seq.
Corresponding Fold Change	Corresponding WT Score / Best MT Score. NA if there is no WT Epitope Seq.
Median MT Score	Median ic50 binding affinity of the mutant epitope of all prediction algorithms used
Median WT Score	Median ic50 binding affinity of the wildtype epitope of all prediction algorithms used. NA if there is no WT Epitope Seq.
Median Fold Change	Median WT Score / Median MT Score. NA if there is no WT Epitope Seq.
Individual Prediction Algorithm WT and MT Scores (multiple)	ic50 scores for the MT Epitope Seq and WT Eptiope Seq for the individual prediction algorithms used
Ranking Score	A useful metric for neoantigen prioritization. A high score suggests a high priority.
Gene Ranking Score	The median value of ranking scores for neoantigens from the same gene. A high score suggests a high priority

Recipe

HLA class I typing using ScanNeo_utils.py

ScanNeo_utils.py RNA_seq.bam

This command will output the input BAM file name and the inferred HLA class I types in the screen.

RNA_seq.bam  HLA-A*01:01,HLA-A*31:01,HLA-B*37:01,HLA-B*51:01,HLA-C*06:02,HLA-C*15:02

Generate an annotated fasta file from a VCF with protein sequences of mutations and matching wildtypes

ScanNeo.py fasta -i input.vcf -l 21 -o output.fasta

Options:

-h, --help            show this help message and exit
-i VCF, --input VCF   A VEP-annotated single-sample VCF containing transcript, Wildtype protein sequence, and Downstream protein sequence information
-l LENGTH, --length LENGTH
                        Length of the peptide sequence to use when creating the FASTA (default: 21)
-d DOWNSTREAM_LENGTH, --downstream_length DOWNSTREAM_LENGTH
                        Cap to limit the downstream sequence length for frameshifts when creating the fasta file.Use 'full' to include the full downstream
                        sequence. (Default: 1000)
-o OUTPUT, --output OUTPUT
                        output fasta file name (default: output.fasta)

Output:

output.fasta file contains predicted protein fasta sequences (including Wildtype and Mutant sequences)

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
VEP_plugins		VEP_plugins
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
ScanNeo.py		ScanNeo.py
ScanNeo_utils.py		ScanNeo_utils.py
config.ini		config.ini
config.ini.example		config.ini.example

License

ylab-hi/ScanNeo

Folders and files

Latest commit

History

Repository files navigation

ScanNeo

Introduction

Getting Started

Prerequisites

install necessary python packages via anaconda

install transIndel

intall IEDB binding prediction tools

install VEP annotations and reference fasta

install VEP plugins

Configuration

configure Optitype

configure config.ini of Optitype

configure yara

configure config.ini file

Usage

STEP 1: INDEL calling using RNA-seq data

Options:

Input:

Output:

STEP 2: indels annotation using VEP

Options:

Input:

Output:

STEP 3: neoantigen prediction

Options:

Output:

Recipe

HLA class I typing using ScanNeo_utils.py

Generate an annotated fasta file from a VCF with protein sequences of mutations and matching wildtypes

Options:

Output:

License

Contact

Citation

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages