-
Notifications
You must be signed in to change notification settings - Fork 5
Tutorial: Design sgRNAs for allele specific excision of the gene MFN2 in the WTC genome
Here are the instructions for how to apply AlleleAnalyzer to a genome to identify allele-specific CRISPR sites and design allele-specific sgRNAs. This tutorial is a "simplest-case" scenario, for more complex features please look through the rest of the wiki and the descriptions accompanying each of the tools.
In order to use the tools described in this tutorial, you will need to have cloned this repo. For more information on cloning a repo, see this page.
Clone the repo with the following command in your terminal:
git clone https://github.com/keoughkath/AlleleAnalyzer.git
Next, make sure you have all of the required tools to run AlleleAnalyzer using this wiki page and checking the requirements.txt
file.
The first data you will need is the VCF or BCF file for your individual; these are files that contain information about genetic variants in an individual often found via sequencing. You can find more information on this file format here. For this tutorial, we've made the phased VCF for iPSC line WTC, generated by the Conklin Lab at the Gladstone Institutes, available here. More information about this line may be found here.
Start by making a directory at the same level as the AlleleAnalyzer directory. For instance, if the ls
command shows the AlleleAnalyzer
directory in your current directory, make a new directory for this tutorial mkdir tutorial_directory
and move into it cd tutorial_directory
(you can copy from the code block below). Next, download the files named wtc_phased_hg19.bcf
and wtc_phased_hg19.bcf.csi
to the directory from which you're completing this tutorial:
mkdir tutorial_directory
cd tutorial_directory
curl https://lighthouse.ucsf.edu/public_files_no_password/excisionFinderData_public/gRNA_tutorial_sample_data/sample_input/wtc_phased_hg19.bcf -o wtc_phased_hg19.bcf
curl https://lighthouse.ucsf.edu/public_files_no_password/excisionFinderData_public/gRNA_tutorial_sample_data/sample_input/wtc_phased_hg19.bcf.csi -o wtc_phased_hg19.bcf.csi
In this tutorial we will analyze the gene MFN2, which is a dominant negative disease gene that causes Charcot-Marie-Tooth Disease. The locus for this gene in the reference genome GRCh37 is 1:12040238-12073572. This indicates that the gene is located on chromosome 1, starts at genomic coordinate 11980181 and ends at genomic coordinate 12013515.
In this step we're grabbing some information about variants in MFN2 in WTC including variant genomic location, reference allele and alternate allele.
copy and paste:
python3 ../AlleleAnalyzer/preprocessing/generate_gens_dfs/get_gens_df.py wtc_phased_hg19.bcf 1:12040238-12073572 mfn2_wtc_hg19
Here is a legend to the above command:
../AlleleAnalyzer/preprocessing/generate_gens_dfs/get_gens_df.py
: script name
wtc_phased_hg19.bcf
: BCF filename
1:12040238-12073572
: locus for MFN2
mfn2_wtc_hg19
: prefix for output file
You should see the following in your terminal if it runs correctly:
{'--bed': False,
'--chrom': False,
'-f': False,
'<locus>': '1:12040238-12073572',
'<out>': 'mfn2_wtc_hg19',
'<vcf_file>': 'wtc_phased_hg19.bcf'}
bcftools version 1.6 running
Running single locus
Lines total/split/realigned/skipped: 7/0/0/0
finished
The outputted file will be:
mfn2_wtc_hg19.h5
To check your output against ours, check out the sample output here.
This section annotates variants that make, break or are near PAM sites. This part requires that you have downloaded the pre-computed locations of PAM sites for SpCas9 analyzed by AlleleAnalyzer (available here). Note that you can generate these files yourself for any genome for which you have a fasta file using the tool preprocessing/find_pams_in_reference/pam_pos_genome.py
, but for this tutorial, it's easier to use the pre-generated files. Make a new directory in your current directory titled 'hg19_pams'. Download chr1_SpCas9_pam_sites_for.npy
and chr1_SpCas9_pam_sites_rev.npy
to the directory 'hg19_pams':
mkdir hg19_pams
curl https://lighthouse.ucsf.edu/public_files_no_password/excisionFinderData_public/gRNA_tutorial_sample_data/sample_input/hg19_pams/chr1_SpCas9_pam_sites_for.npy -o hg19_pams/chr1_SpCas9_pam_sites_for.npy
curl https://lighthouse.ucsf.edu/public_files_no_password/excisionFinderData_public/gRNA_tutorial_sample_data/sample_input/hg19_pams/chr1_SpCas9_pam_sites_rev.npy -o hg19_pams/chr1_SpCas9_pam_sites_rev.npy
Additionally, you will need the fasta file for GRCh37 (hg19), which you can download from the UCSC genome browser. Download the file chr1.fa.gz
to your current directory and gunzip it:
curl http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz -o chr1.fa.gz
gunzip chr1.fa.gz
copy and paste:
Note: this is a bit slow (it's doing a lot of work).
python3 ../AlleleAnalyzer/preprocessing/annotate_variants/annot_variants.py mfn2_wtc_hg19.h5 SpCas9 hg19_pams/ chr1.fa mfn2_hg19_annots
Here is a legend to the above command:
../AlleleAnalyzer/preprocessing/annotate_variants/annot_variants.py
: script name
mfn2_wtc_hg19.h5
: File with explicit genotypes generated earlier
SpCas9
: Type of Cas being evaluated
hg19_pams/
: Directory containing PAM site locations for SpCas9 in hg19
chr1.fa
: hg19 chromosome 1 Fasta file
mfn2_hg19_annots
: Outputted file prefix with annotations for each variant of allele-specific sgRNA sites
The outputted file will be:
mfn2_hg19_annots.h5
To check your output against ours, check out the sample output here.
This section designs all possible allele-specific guides in the gene MFN2 for WTC based on heterozygous variants.
copy and paste:
python3 ../AlleleAnalyzer/scripts/gen_sgRNAs.py wtc_phased_hg19.bcf mfn2_hg19_annots.h5 1:12040238-12073572 hg19_pams/ chr1.fa mfn2_wtc_guides SpCas9 20
Here is a legend to the above command:
../AlleleAnalyzer/scripts/gen_sgRNAs.py
: script name
wtc_phased_hg19.bcf
: BCF genotype file
mfn2_hg19_annots.h5
: Variant annotations in this locus for generate allele-specific sgRNA sites
1:12040238-12073572
: MFN2 locus
hg19_pams/
: Directory containing PAM site locations for SpCas9 in hg19
chr1.fa
: hg19 chromosome 1 Fasta file
mfn2_wtc
: Prefix for outputted guides file
SpCas9
: Type of Cas evaluated
20
: Length of sgRNA
This should output mfn2_wtc_guides.tsv
. The latter four sets of sgRNAs will have one sgRNA that is all "C"s or "G"s. This indicates that the heterozygous variant that the sgRNA is designed around creates or destroys a PAM site, thereby rendering on the alleles untargetable. The option -d will instead output these sgRNAs are "----" if desired by the user.
This section identifies pairs of allele-specific sgRNA sites that are likely to disrupt a coding exon, thus meeting our definition of "putatively targetable", and outputs their guides. This requires you to have a GFF file in your current directory that describes where the coding exons are for genes for this reference genome annotation. One place to download these types of files are from RefSeq
Sample Usage:
python3 ../AlleleAnalyzer/scripts/ExcisionFinder.py -vg genes_hg19.gff MFN2 mfn2_hg19_annots.h5 10000 SpCas9 wtc_phased_hg19.bcf wtc_targ --guides=mfn2_wtc_guides.tsv
Here is a legend to the above command:
../AlleleAnalyzer/scripts/ExcisionFinder.py
: script name
-vg
: options specifying that we want "verbose" output (i.e. the script prints out messages as it runs) and we want guides outputted for the targetable variant pairs
gene_list_hg37.tsv
: File detailing locations of coding exons for genes, necessary for determining targetability
MFN2
: The gene we're analyzing
mfn2_hg19_annots.h5
: Variant annotations in this locus for generate allele-specific sgRNA sites
10000
: Maximum distance (in bp) for targetable variant pairs
SpCas9
: The Cas variety being analyzed
wtc_phased_hg19.bcf
: BCF filename
wtc_targ
: Prefix for output files
--guides=mfn2_wtc_guides.tsv
: All allele-specific guides available in this locus, as generated earlier
This should output 3 files, wtc_targ.h5
, wtc_targgenes_evaluated.txt
, and wtc_targpair_guides.tsv
.
wtc_targ.hg5
simply tells you whether MFN2 in WTC is targetable for allele-specific excision. wtc_targgenes_evaluated.txt
is more handy when evaluating multiple genes/loci, as it is a list of all genes that had enough variants annotated and coding exons in order to be evaluated. wtc_targpair_guides.tsv
is the sgRNAs for the identified targetable variant pairs.
To check your output against ours, check out the sample output here.
Please send us a note with any questions or if anything in here is confusing!
AlleleAnalyzer. Keough et al. 2019, Genome Biology.