1. Reference preparation

In this walk-through, to keep datafiles small, we use analysis of data for one chromosome (mouse chromosome 10).

Files needed at the very beginning of the analysis:

Reference genome

For example, GRCm38_68.fa.

wget ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa

For a test sample please find data/ref/GRCm38_68.chromosome10.fa.gz in current repository.

Either:

One/Two vcf files for maternal and paternal inbred lines (should be compatible with the reference genome)
Joint vcf file for multiple species, where two lines are presented (should be compatible with reference genome)
Individual vcf file or joint individuals vcf file (should be compatible with reference genome) (see correponding section about reference preparation)

For example, mgp.v5.merged.snps_all.dbSNP142.vcf.gz for multiple mice lines.

wget ftp://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_all.dbSNP142.vcf.gz

For a test sample please find data/ref/mgp.v5.merged.snps_all.dbSNP142.chromosome10.strains_subsample.snps_subsample.vcf.gz in current repository.

Note: you can do vcf calling by using mutect, varscan, etc. if you have no pre-existing vcf

(optional) Either:

gtf annotation for the reference genome
bed file with selected regions These are needed for vcf filtering.

For example, Mus_musculus.GRCm38.68.gtf.gz for corresponding reference genome version.

wget ftp://ftp.ensembl.org/pub/release-68/gtf/mus_musculus/Mus_musculus.GRCm38.68.gtf.gz

For a test sample please find data/ref/Mus_musculus.GRCm38.68.chromosome10.gtf.gz in current repository.

Preparing files for using as reference:

For reference genome fasta should be created the index (samtools faidx) and the dictionary (picard CreateSequenceDictionary, see gatkforums for more information).
vcf-files should be compressed (bgzip from htslib) and indexed (tabix from htslib).

Reference preparation:

One script to rule them all (don't forget to insert real paths to corresponding files and directories):

Note: this procedure may take hours on the real data. (For our example data it will take few minutes only)

python3 python3/prepare_reference.py --PSEUDOREF True --HETVCF True \
  --pseudoref_dir /full/path/to/dir/for/pseudo/ref/out/ \
  --vcf_dir /full/path/todir/for/vcf/outputs/ \
  --ref /full/path/to/ref/genome/GRCm38_68.chromosome10.fa \
  --name_mat 129S1_SvImJ --name_pat CAST_EiJ \
  --vcf_joint /full/path/to/vcf/mgp.v5.merged.snps_all.dbSNP142.chromosome10.strains_subsample.snps_subsample.vcf.gz \
  --gtf /full/path/to/gtf/Mus_musculus.GRCm38.68.chromosome10.gtf

For help:

python3 python3/prepare_reference.py --help

Pseudoreference fasta creation:

--PSEUDOREF True

Input:

	Inbred lines	Inbred lines	Individual	Individual
	Joint lines vcf	Separate line(s) vcf	Joint individuals vcf	Separate individual vcf
FASTA Reference genome	--ref	--ref	--ref	--ref
VCF Variant file(s)[1]	--vcf_joint	--vcf_mat, --vcf_pat	--vcf_joind	--vcf_ind
Name(s)	--name_mat, --name_pat	--name_mat, --name_pat	--name_ind	--name_ind
FASTA Output directory	--pseudo_dir	--pseudo_dir	--pseudo_dir	--pseudo_dir
VCF Output directory	--vcf_dir	--vcf_dir	--vcf_dir	--vcf_dir

[1] If one or the alleles in case of inbred lines is reference, then everything should be provided as mat or pat only, consistently.

Output:
- Pseudoreference genome fastas with own directories.
- Supporting vcf or bed files (if needed).

Heterozygous(parental) VCF creation:

--HETVCF True

Input:

	Inbred lines	Inbred lines	Individual	Individual
	Joint lines vcf	Separate line(s) vcf	Joint individuals vcf	Separate individual vcf
FASTA Reference genome	--ref	--ref	--ref	--ref
nothing or GTF or BED Selected regions annotation [2]	--gtf or --bed	--gtf or --bed	--gtf or --bed	--gtf or --bed
VCF Variant file(s)[1]	--vcf_joint	--vcf_mat, --vcf_pat	--vcf_joind	--vcf_ind
Name(s)	--name_mat, --name_pat	--name_mat, --name_pat	--name_ind	--name_ind
VCF Output directory	--vcf_dir	--vcf_dir	--vcf_dir	--vcf_dir

[1] If one or the alleles in case of inbred lines is reference, then everything should be provided as mat or pat only, consistently. [2] Bed will be considered as main; if gtf provided, automatically considered regions=exons and groups=genes; bed file should have 4 columns and prepared in advance: contig, start position, end position, group ID (no colnames).

Output:
- VCF with heterozygous positions, one allele as reference and the second as alternative.
- Support vcf files (if needed).

Additional tables with regions and variants of interest, needed in further analysis:

bed file (with four columns: contig, start and end positions, group ID; no column names) with selected regions, for example, exon regions grouped by genes:

awk '$3=="exon" && ($1 ~ /^[1-9]/ || $1 == "X" || $1 == "Y")' /full/path/to/Mus_musculus.GRCm38.68.chromosome10.gtf | cut -f1,4,5,9 | awk -F $'\t' 'BEGIN {OFS = FS} {split($4, a, "gene_id \""); split(a[2], b, "\""); print $1, $2-1, $3, b[1]}' > /full/path/to/output/Mus_musculus.GRCm38.68.chromosome10.EXONS.bed

Note that you may use any method to create such a table which is convenient to you, here is provided only one of them.

snp table, which can be obtained, for regions of interest, from vcf with heterozigous positions created above, as 5 first columns (in our case it is relevant to look only at exon snps):

grep "^#CHROM" /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf | cut -f1-5 > /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.snp_table.txt
grep -v "^#" /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf | sort -V | cut -f1-5 >> /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.snp_table.txt

Next

Provide feedback

Saved searches

Use saved searches to filter your results more quickly