median wrong in assembly_stats tool aka fasta_summary.pl #96

peterjc · 2018-05-16T14:59:17Z

Noticed on a real dataset where my Python code did not get the same median contig length, reduced to a minimal test case:

$ cat /tmp/median.fasta 
>one
A
>two
AA
>three
AAA
>four
AAAA

There are an even number of sequences, 1, 2, 3, 4, meaning the median should be the mean of the middle two values, which is 2.5 in this trivial case.

As can be seen below, fasta_summary.pl picks the larger of the middle two values instead (which in the context of sequences will ensure the answer is an integer), so in practise the error will often be larger:

$ perl fasta_summary.pl -i /tmp/median.fasta -o /tmp && more /tmp/stats.txt 
  Directory '/tmp' exists, so the existing fasta_summary.pl output files will be overwritten
Statistics for read lengths:
	Min read length:	1
	Max read length:	4
	Mean read length:	2.50
	Standard deviation of read length:	1.12
	Median read length:	3
	N50 read length:	3

Statistics for numbers of reads:
	Number of reads:	4
	Number of reads >=1kb:	0
	Number of reads in N50:	2

Statistics for bases in the reads:
	Number of bases in all reads:	10
	Number of bases in reads >=1kb:	0
	GC Content of reads:	0.00 %

Simple Dinucleotide repeats:
	Number of reads with over 70% dinucleotode repeats:	0.00 % (0 reads)
	AT:	0.00 % (0 reads)
	CG:	0.00 % (0 reads)
	AC:	0.00 % (0 reads)
	TG:	0.00 % (0 reads)
	AG:	0.00 % (0 reads)
	TC:	0.00 % (0 reads)

Simple mononucleotide repeats:
	Number of reads with over 50% mononucleotode repeats:	50.00 % (2 reads)
	AA:	50.00 % (2 reads)
	TT:	0.00 % (0 reads)
	CC:	0.00 % (0 reads)
	GG:	0.00 % (0 reads)

CC original author @sujaikumar - given the Perl script has been used widely it may be simpler just to document this behaviour? Is there an official repository for this script?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

median wrong in assembly_stats tool aka fasta_summary.pl #96

median wrong in assembly_stats tool aka fasta_summary.pl #96

peterjc commented May 16, 2018

median wrong in assembly_stats tool aka fasta_summary.pl #96

median wrong in assembly_stats tool aka fasta_summary.pl #96

Comments

peterjc commented May 16, 2018