N50 and L50 jargon is confusing #15

johnomics · 2017-06-08T09:26:03Z

Prerequisites

[ X ] make sure you're are using the latest version by seqkit version
[ X ] read the usage

Describe your issue

Thanks for building seqkit, it is an extremely useful tool that I use every day.

seqkit stats -a produces N50 and L50 statistics. These labels are very confusing; 'N50' is the 'N50 length', the length of read such that 50% of the bases are in reads of this length or longer. 'L50' is the 'N50 number', the number of reads in this set. The term L50 has no connection with its meaning and in fact suggests it is to do with a length, which is not true. It would be much better to to use the terms 'N50 length' and 'N50 number' (or similar terms) to make the meaning of these statistics clear. I realise other tools use the same jargon but it is unclear and would be better replaced.

The text was updated successfully, but these errors were encountered:

shenwei356 · 2017-06-08T11:20:08Z

... min_len  avg_len  max_len  sum_gap  N50     L50
...      39      103    2,354        0  101  10,075

... min_len  avg_len  max_len  sum_gap  N50_len  N50_num
...      39      103    2,354        0  101       10,075

... min_len  avg_len  max_len  sum_gap  N50  L50(N50_num)
...      39      103    2,354        0  101       10,075

Does these read better?

I'm afraid we can only add some explanation before making it more confusing. :)

johnomics · 2017-06-08T11:28:56Z

Thanks for looking at this so quickly. I think the second version (N50_len and N50_num) works well - clear and compact. It would be better not to use L50 to refer to the N50 number at all - I think this usage should be avoided, even if it is found elsewhere.

Just my opinion though - some context and debate here and here.

shenwei356 · 2017-06-08T11:40:12Z

Thanks John, let's just discard the L50 which brings confusion.

johnomics · 2017-06-08T11:41:20Z

Great, thank you.

RhettRautsaw · 2024-01-11T14:37:02Z

I feel like the conclusion of this thread was that you should use N50_num and N50_len (rather than L50), but then the implementation was that you just to remove N50_num altogether. I agree with @johnomics that L50 is confusing and N50_num is more appropriate, but I disagree with it's removal entirely. I would recommend putting N50_num back into seqkit stats.

shenwei356 · 2024-01-11T15:23:28Z

Just checked the code. L50 (N50_num) is computed but hidden. 😄

Yutang-ETH · 2024-02-22T14:02:07Z

Hi Shenwei @shenwei356 ,

Sorry to jump in here, but I think this thread might be the best place to discuss my request. I guess N50 or L50 is not confusing to people anymore since high-throughput sequencing technologies are so common today (compared to 2017). I completely agree with @RhettRautsaw, I think it is time now to bring Lx stats back to seqkit. This would be very cool for large pangenome projects using only seqkit to calculate all the stats wanted. What do you think?

By the way, I really like seqkit! Thank you very much for providing this efficient and versatile tool for the world.

Best wishes,
Yutang

shenwei356 · 2024-02-22T16:16:05Z

Just added a new column N50_num (L50).

$ echo -ne "aa\naaa\naaaa\naaaaa\naaaaaa\naaaaaaa\naaaaaaaa\naaaaaaaaa\naaaaaaaaaa\n" | csvtk mutate -Ht | seqkit tab2fx \
    | seqkit fx2tab -l -n | csvtk add-header -t -n seq,len | csvtk pretty -t
seq          len
----------   ---
aa           2  
aaa          3  
aaaa         4  
aaaaa        5  
aaaaaa       6  
aaaaaaa      7  
aaaaaaaa     8  
aaaaaaaaa    9  
aaaaaaaaaa   10

$ echo -ne "aa\naaa\naaaa\naaaaa\naaaaaa\naaaaaaa\naaaaaaaa\naaaaaaaaa\naaaaaaaaaa\n" | csvtk mutate -Ht | seqkit tab2fx \
    | seqkit stats -a
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len  Q1  Q2  Q3  sum_gap  N50  N50_num  Q20(%)  Q30(%)  AvgQual  GC(%)
-     FASTA   DNA          9       54        2        6       10   4   6   8        0    8        3       0       0        0      0

Yutang-ETH · 2024-02-22T16:23:40Z

Wow, what a fast reply @shenwei356. Thank you very much.

I know I am asking too much, but it would be great to also support -L just like -N so that we can calculate -L 50, 90. What do you think? I really appreciate your work!

Best wishes,
Yutang

johnomics closed this as completed Jun 8, 2017

shenwei356 added a commit that referenced this issue Jun 8, 2017

#15

e1f324d

shenwei356 added a commit that referenced this issue Feb 22, 2024

stats: add N50_num (L50). #15

2ea68c8

shenwei356 added the new feature label Feb 22, 2024

shenwei356 reopened this Feb 22, 2024

shenwei356 mentioned this issue Mar 11, 2024

Update SeqKit to v2.8.0 bioconda/bioconda-recipes#46317

Merged

BrewTestBot mentioned this issue Mar 11, 2024

seqkit 2.8.0 Homebrew/homebrew-core#165818

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

N50 and L50 jargon is confusing #15

N50 and L50 jargon is confusing #15

johnomics commented Jun 8, 2017

shenwei356 commented Jun 8, 2017

johnomics commented Jun 8, 2017

shenwei356 commented Jun 8, 2017

johnomics commented Jun 8, 2017

RhettRautsaw commented Jan 11, 2024

shenwei356 commented Jan 11, 2024

Yutang-ETH commented Feb 22, 2024

shenwei356 commented Feb 22, 2024

Yutang-ETH commented Feb 22, 2024

N50 and L50 jargon is confusing #15

N50 and L50 jargon is confusing #15

Comments

johnomics commented Jun 8, 2017

Prerequisites

Describe your issue

shenwei356 commented Jun 8, 2017

johnomics commented Jun 8, 2017

shenwei356 commented Jun 8, 2017

johnomics commented Jun 8, 2017

RhettRautsaw commented Jan 11, 2024

shenwei356 commented Jan 11, 2024

Yutang-ETH commented Feb 22, 2024

shenwei356 commented Feb 22, 2024

Yutang-ETH commented Feb 22, 2024