Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

median wrong in assembly_stats tool aka fasta_summary.pl #96

Open
peterjc opened this issue May 16, 2018 · 0 comments
Open

median wrong in assembly_stats tool aka fasta_summary.pl #96

peterjc opened this issue May 16, 2018 · 0 comments

Comments

@peterjc
Copy link

peterjc commented May 16, 2018

Noticed on a real dataset where my Python code did not get the same median contig length, reduced to a minimal test case:

$ cat /tmp/median.fasta 
>one
A
>two
AA
>three
AAA
>four
AAAA

There are an even number of sequences, 1, 2, 3, 4, meaning the median should be the mean of the middle two values, which is 2.5 in this trivial case.

As can be seen below, fasta_summary.pl picks the larger of the middle two values instead (which in the context of sequences will ensure the answer is an integer), so in practise the error will often be larger:

$ perl fasta_summary.pl -i /tmp/median.fasta -o /tmp && more /tmp/stats.txt 
  Directory '/tmp' exists, so the existing fasta_summary.pl output files will be overwritten
Statistics for read lengths:
	Min read length:	1
	Max read length:	4
	Mean read length:	2.50
	Standard deviation of read length:	1.12
	Median read length:	3
	N50 read length:	3

Statistics for numbers of reads:
	Number of reads:	4
	Number of reads >=1kb:	0
	Number of reads in N50:	2

Statistics for bases in the reads:
	Number of bases in all reads:	10
	Number of bases in reads >=1kb:	0
	GC Content of reads:	0.00 %

Simple Dinucleotide repeats:
	Number of reads with over 70% dinucleotode repeats:	0.00 % (0 reads)
	AT:	0.00 % (0 reads)
	CG:	0.00 % (0 reads)
	AC:	0.00 % (0 reads)
	TG:	0.00 % (0 reads)
	AG:	0.00 % (0 reads)
	TC:	0.00 % (0 reads)

Simple mononucleotide repeats:
	Number of reads with over 50% mononucleotode repeats:	50.00 % (2 reads)
	AA:	50.00 % (2 reads)
	TT:	0.00 % (0 reads)
	CC:	0.00 % (0 reads)
	GG:	0.00 % (0 reads)


CC original author @sujaikumar - given the Perl script has been used widely it may be simpler just to document this behaviour? Is there an official repository for this script?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant