-
Notifications
You must be signed in to change notification settings - Fork 5
1. Reference preparation
In this walk-through, to keep datafiles small, we use analysis of data for one chromosome (mouse chromosome 10).
- Reference genome
For example,
GRCm38_68.fa
.
wget ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa
For a test sample please find data/ref/GRCm38_68.chromosome10.fa.gz
in current repository.
- Either:
- One/Two vcf files for maternal and paternal inbred lines (should be compatible with the reference genome)
- Joint vcf file for multiple species, where two lines are presented (should be compatible with reference genome)
- Individual vcf file or joint individuals vcf file (should be compatible with reference genome) (see correponding section about reference preparation)
For example,
mgp.v5.merged.snps_all.dbSNP142.vcf.gz
for multiple mice lines.
wget ftp://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_all.dbSNP142.vcf.gz
For a test sample please find data/ref/mgp.v5.merged.snps_all.dbSNP142.chromosome10.strains_subsample.snps_subsample.vcf.gz
in current repository.
Note: you can do vcf calling by using mutect, varscan, etc. if you have no pre-existing vcf
- (optional) Either:
- gtf annotation for the reference genome
- bed file with selected regions These are needed for vcf filtering.
For example,
Mus_musculus.GRCm38.68.gtf.gz
for corresponding reference genome version.
wget ftp://ftp.ensembl.org/pub/release-68/gtf/mus_musculus/Mus_musculus.GRCm38.68.gtf.gz
For a test sample please find data/ref/Mus_musculus.GRCm38.68.chromosome10.gtf.gz
in current repository.
-
For reference genome fasta should be created the index (
samtools faidx
) and the dictionary (picard CreateSequenceDictionary
, see gatkforums for more information). -
vcf-files should be compressed (
bgzip
fromhtslib
) and indexed (tabix
fromhtslib
).
One script to rule them all (don't forget to insert real paths to corresponding files and directories):
Note: this procedure may take hours on the real data. (For our example data it will take few minutes only)
python3 python3/prepare_reference.py --PSEUDOREF True --HETVCF True \
--pseudoref_dir /full/path/to/dir/for/pseudo/ref/out/ \
--vcf_dir /full/path/todir/for/vcf/outputs/ \
--ref /full/path/to/ref/genome/GRCm38_68.chromosome10.fa \
--name_mat 129S1_SvImJ --name_pat CAST_EiJ \
--vcf_joint /full/path/to/vcf/mgp.v5.merged.snps_all.dbSNP142.chromosome10.strains_subsample.snps_subsample.vcf.gz \
--gtf /full/path/to/gtf/Mus_musculus.GRCm38.68.chromosome10.gtf
For help:
python3 python3/prepare_reference.py --help
--PSEUDOREF True
- Input:
Inbred lines | Inbred lines | Individual | Individual | |
---|---|---|---|---|
Joint lines vcf | Separate line(s) vcf | Joint individuals vcf | Separate individual vcf | |
FASTA Reference genome | --ref | --ref | --ref | --ref |
VCF Variant file(s)[1] | --vcf_joint | --vcf_mat, --vcf_pat | --vcf_joind | --vcf_ind |
Name(s) | --name_mat, --name_pat | --name_mat, --name_pat | --name_ind | --name_ind |
FASTA Output directory | --pseudo_dir | --pseudo_dir | --pseudo_dir | --pseudo_dir |
VCF Output directory | --vcf_dir | --vcf_dir | --vcf_dir | --vcf_dir |
[1] If one or the alleles in case of inbred lines is reference, then everything should be provided as mat or pat only, consistently.
- Output:
- Pseudoreference genome fastas with own directories.
- Supporting vcf or bed files (if needed).
--HETVCF True
- Input:
Inbred lines | Inbred lines | Individual | Individual | |
---|---|---|---|---|
Joint lines vcf | Separate line(s) vcf | Joint individuals vcf | Separate individual vcf | |
FASTA Reference genome | --ref | --ref | --ref | --ref |
nothing or GTF or BED Selected regions annotation [2] | --gtf or --bed | --gtf or --bed | --gtf or --bed | --gtf or --bed |
VCF Variant file(s)[1] | --vcf_joint | --vcf_mat, --vcf_pat | --vcf_joind | --vcf_ind |
Name(s) | --name_mat, --name_pat | --name_mat, --name_pat | --name_ind | --name_ind |
VCF Output directory | --vcf_dir | --vcf_dir | --vcf_dir | --vcf_dir |
[1] If one or the alleles in case of inbred lines is reference, then everything should be provided as mat or pat only, consistently. [2] Bed will be considered as main; if gtf provided, automatically considered regions=exons and groups=genes; bed file should have 4 columns and prepared in advance: contig, start position, end position, group ID (no colnames).
- Output:
- VCF with heterozygous positions, one allele as reference and the second as alternative.
- Support vcf files (if needed).
- bed file (with four columns: contig, start and end positions, group ID; no column names) with selected regions, for example, exon regions grouped by genes:
awk '$3=="exon" && ($1 ~ /^[1-9]/ || $1 == "X" || $1 == "Y")' /full/path/to/Mus_musculus.GRCm38.68.chromosome10.gtf | cut -f1,4,5,9 | awk -F $'\t' 'BEGIN {OFS = FS} {split($4, a, "gene_id \""); split(a[2], b, "\""); print $1, $2-1, $3, b[1]}' > /full/path/to/output/Mus_musculus.GRCm38.68.chromosome10.EXONS.bed
Note that you may use any method to create such a table which is convenient to you, here is provided only one of them.
- snp table, which can be obtained, for regions of interest, from vcf with heterozigous positions created above, as 5 first columns (in our case it is relevant to look only at exon snps):
grep "^#CHROM" /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf | cut -f1-5 > /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.snp_table.txt
grep -v "^#" /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf | sort -V | cut -f1-5 >> /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.snp_table.txt