Interpreting the output

Once you have run seer you'll have a list of k-mers, with various association information attached, to which you will want to assign a biological interpretation.

Filter k-mers
Manhattan plots
Mapping to assemblies
Fine mapping

Filter significant k-mers

Before going any further, run the post analysis program filter_seer or use awk. Example filtering parameters would be

./filter_seer -k significant_kmers.txt --pos_beta --maf 0.05 --sort pval > seer.filtered.txt

Visualisation in phandango

A very nice way of looking at the output is to use phandango -- particularly in combination with other representations of your data such as a phylogeny, metadata and combination blocks.

Find a close reference

Use NCBI blastn or similar on some of your top hits to find an annotated reference which contains the significant elements. You'll need to get the reference in gff format, whereas if looking in NCBI you'll probably get a genbank and a fasta file. These can be converted using seqret:

seqret -sequence s_pyogenes_HKU488.fasta -feature -fformat gb -fopenfile s_pyogenes_HKU488.gb -osformat gff -auto

Map your kmers to the reference

Use the provided script to convert your filtered significant kmers into fastq format

scripts/hits_to_fastq.pl -k result.txt -b 10e-8 > result.fastq

There are plenty of mappers you can use to find the position of the kmers:

For partial matches blast or blat work well
For exact matches bwa fastmap is best
In general I find bowtie2 leads to the most useful interpretation

Example bowtie2 settings which have achieved good sensitivity

bowtie2-build S_pyogenes_NGAS322.fa S_pyogenes_NGAS322
bowtie2 -q -U result.fastq --ignore-quals -D 24 -R 3 -N 0 -L 7 -i S,1,0.50 -x S_pyogenes_NGAS322 -S NGAS322.sam

Which will create NGAS322.sam

If you have a lot of secondary mappings, you may wish to report them by adding the -a or -k option to bowtie2

Convert to a .plot file

The script mapping_to_phandango.pl will do this

scripts/mapping_to_phandango.pl -s NGAS322.sam -k seer.filtered.txt -m 20 > phandango.plot

Change 'NGAS322.sam' and 'seer.filtered.txt' to the correct files in your case. You'll then be able to drag and drop phandango.plot, the reference .gff and other data to see a Manhattan plot against the genome.

Mapping back to assemblies

Run map_back to find the position of significant kmers, in the assemblies they are from

./map_back -k seer.filtered.k31.txt -r references.txt --threads 4

Run blat or blast on the significant kmers

cut -f 1 seer.filtered.k31.txt | awk '{print ">"NR"\n"$1}' > seer.sorted.k31.fa
blat -noHead -stepSize=2 -minScore=15 reference.fa seer.sorted.k31.fa seer.k31.blat.psl
blast -task blastn-short -subject reference.fa -query seer.sorted.k31.fa -outfmt 6 seer.k31.blast.txt

Reduce this to the top hits

./scripts/blast_top_hits.pl --input seer.k31.blat.psl --mode blat --full_query > top_hits.txt

If you have bedops and bedtools you can annotate the position of these hits, for example

psl2bed < seer.k31.blat.psl > seer.k31.blat.bed
gff2bed < reference.gff > reference.bed
bedtools intersect -a seer.k31.blat.bed -b reference.bed -wb
bedtools intersect -a reference.bed -b seer.k31.blat.bed -c | sort -n -r -t $'\t' -k11,11

Fine mapping

If you have a lot of samples, you can try and fine-map the causal variant. Call SNPs from your mapping (the JScandy section describes how to obtain this) with samtools

samtools sort -O sam -T tmp NGAS322.sam  | samtools view -h -F 0x4 -b - > NGAS322.bam
samtools mpileup -Q 0 -ugf S_pyogenes_NGAS322.fa NGAS322.bam | bcftools call -P 0 -vmO z > NGAS322.vcf.gz

Filter your SNPs. QUAL > 100 is a good start. You can annotate them with SnpEff, and predict protein function with SIFT.

java -Xmx200m -jar snpEff.jar -c snpEff.config -v S_pyogenes_NGAS322 NGAS322.vcf.gz > NGAS322.vcf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly