-
Notifications
You must be signed in to change notification settings - Fork 10
Interpreting the output
Once you have run seer you'll have a list of k-mers, with various association information attached, to which you will want to assign a biological interpretation.
Before going any further, run the post analysis program filter_seer or use awk. Example filtering parameters would be
./filter_seer -k significant_kmers.txt --pos_beta --maf 0.05 --sort pval > seer.filtered.txt
Visualisation in phandango
A very nice way of looking at the output is to use phandango -- particularly in combination with other representations of your data such as a phylogeny, metadata and combination blocks.
Use NCBI blastn or similar on some of your top hits to find an annotated reference which contains the significant elements. You'll need to get the reference in gff format, whereas if looking in NCBI you'll probably get a genbank and a fasta file. These can be converted using seqret:
seqret -sequence s_pyogenes_HKU488.fasta -feature -fformat gb -fopenfile s_pyogenes_HKU488.gb -osformat gff -auto
Use the provided script to convert your filtered significant kmers into fastq format
scripts/hits_to_fastq.pl -k result.txt -b 10e-8 > result.fastq
There are plenty of mappers you can use to find the position of the kmers:
- For partial matches blast or blat work well
- For exact matches bwa fastmap is best
- In general I find bowtie2 leads to the most useful interpretation
Example bowtie2 settings which have achieved good sensitivity
bowtie2-build S_pyogenes_NGAS322.fa S_pyogenes_NGAS322
bowtie2 -q -U result.fastq --ignore-quals -D 24 -R 3 -N 0 -L 7 -i S,1,0.50 -x S_pyogenes_NGAS322 -S NGAS322.sam
Which will create NGAS322.sam
If you have a lot of secondary mappings, you may wish to report them by adding the -a or -k option to bowtie2
The script mapping_to_phandango.pl will do this
scripts/mapping_to_phandango.pl -s NGAS322.sam -k seer.filtered.txt -m 20 > phandango.plot
Change 'NGAS322.sam' and 'seer.filtered.txt' to the correct files in your case. You'll then be able to drag and drop phandango.plot, the reference .gff and other data to see a Manhattan plot against the genome.
Run map_back to find the position of significant kmers, in the assemblies they are from
./map_back -k seer.filtered.k31.txt -r references.txt --threads 4
Run blat or blast on the significant kmers
cut -f 1 seer.filtered.k31.txt | awk '{print ">"NR"\n"$1}' > seer.sorted.k31.fa
blat -noHead -stepSize=2 -minScore=15 reference.fa seer.sorted.k31.fa seer.k31.blat.psl
blast -task blastn-short -subject reference.fa -query seer.sorted.k31.fa -outfmt 6 seer.k31.blast.txt
Reduce this to the top hits
./scripts/blast_top_hits.pl --input seer.k31.blat.psl --mode blat --full_query > top_hits.txt
If you have bedops and bedtools you can annotate the position of these hits, for example
psl2bed < seer.k31.blat.psl > seer.k31.blat.bed
gff2bed < reference.gff > reference.bed
bedtools intersect -a seer.k31.blat.bed -b reference.bed -wb
bedtools intersect -a reference.bed -b seer.k31.blat.bed -c | sort -n -r -t $'\t' -k11,11
If you have a lot of samples, you can try and fine-map the causal variant. Call SNPs from your mapping (the JScandy section describes how to obtain this) with samtools
samtools sort -O sam -T tmp NGAS322.sam | samtools view -h -F 0x4 -b - > NGAS322.bam
samtools mpileup -Q 0 -ugf S_pyogenes_NGAS322.fa NGAS322.bam | bcftools call -P 0 -vmO z > NGAS322.vcf.gz
Filter your SNPs. QUAL > 100 is a good start. You can annotate them with SnpEff, and predict protein function with SIFT.
java -Xmx200m -jar snpEff.jar -c snpEff.config -v S_pyogenes_NGAS322 NGAS322.vcf.gz > NGAS322.vcf