Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

N50 and L50 jargon is confusing #15

Open
johnomics opened this issue Jun 8, 2017 · 9 comments
Open

N50 and L50 jargon is confusing #15

johnomics opened this issue Jun 8, 2017 · 9 comments

Comments

@johnomics
Copy link

Prerequisites

  • [ X ] make sure you're are using the latest version by seqkit version
  • [ X ] read the usage

Describe your issue

Thanks for building seqkit, it is an extremely useful tool that I use every day.

seqkit stats -a produces N50 and L50 statistics. These labels are very confusing; 'N50' is the 'N50 length', the length of read such that 50% of the bases are in reads of this length or longer. 'L50' is the 'N50 number', the number of reads in this set. The term L50 has no connection with its meaning and in fact suggests it is to do with a length, which is not true. It would be much better to to use the terms 'N50 length' and 'N50 number' (or similar terms) to make the meaning of these statistics clear. I realise other tools use the same jargon but it is unclear and would be better replaced.

@shenwei356
Copy link
Owner

... min_len  avg_len  max_len  sum_gap  N50     L50
...      39      103    2,354        0  101  10,075

... min_len  avg_len  max_len  sum_gap  N50_len  N50_num
...      39      103    2,354        0  101       10,075

... min_len  avg_len  max_len  sum_gap  N50  L50(N50_num)
...      39      103    2,354        0  101       10,075

Does these read better?

I'm afraid we can only add some explanation before making it more confusing. :)

@johnomics
Copy link
Author

Thanks for looking at this so quickly. I think the second version (N50_len and N50_num) works well - clear and compact. It would be better not to use L50 to refer to the N50 number at all - I think this usage should be avoided, even if it is found elsewhere.

Just my opinion though - some context and debate here and here.

@shenwei356
Copy link
Owner

Thanks John, let's just discard the L50 which brings confusion.

@johnomics
Copy link
Author

Great, thank you.

shenwei356 added a commit that referenced this issue Jun 8, 2017
@RhettRautsaw
Copy link

I feel like the conclusion of this thread was that you should use N50_num and N50_len (rather than L50), but then the implementation was that you just to remove N50_num altogether. I agree with @johnomics that L50 is confusing and N50_num is more appropriate, but I disagree with it's removal entirely. I would recommend putting N50_num back into seqkit stats.

@shenwei356
Copy link
Owner

Just checked the code. L50 (N50_num) is computed but hidden. 😄

@Yutang-ETH
Copy link

Hi Shenwei @shenwei356 ,

Sorry to jump in here, but I think this thread might be the best place to discuss my request. I guess N50 or L50 is not confusing to people anymore since high-throughput sequencing technologies are so common today (compared to 2017). I completely agree with @RhettRautsaw, I think it is time now to bring Lx stats back to seqkit. This would be very cool for large pangenome projects using only seqkit to calculate all the stats wanted. What do you think?

By the way, I really like seqkit! Thank you very much for providing this efficient and versatile tool for the world.

Best wishes,
Yutang

shenwei356 added a commit that referenced this issue Feb 22, 2024
@shenwei356
Copy link
Owner

Just added a new column N50_num (L50).

$ echo -ne "aa\naaa\naaaa\naaaaa\naaaaaa\naaaaaaa\naaaaaaaa\naaaaaaaaa\naaaaaaaaaa\n" | csvtk mutate -Ht | seqkit tab2fx \
    | seqkit fx2tab -l -n | csvtk add-header -t -n seq,len | csvtk pretty -t
seq          len
----------   ---
aa           2  
aaa          3  
aaaa         4  
aaaaa        5  
aaaaaa       6  
aaaaaaa      7  
aaaaaaaa     8  
aaaaaaaaa    9  
aaaaaaaaaa   10

$ echo -ne "aa\naaa\naaaa\naaaaa\naaaaaa\naaaaaaa\naaaaaaaa\naaaaaaaaa\naaaaaaaaaa\n" | csvtk mutate -Ht | seqkit tab2fx \
    | seqkit stats -a
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len  Q1  Q2  Q3  sum_gap  N50  N50_num  Q20(%)  Q30(%)  AvgQual  GC(%)
-     FASTA   DNA          9       54        2        6       10   4   6   8        0    8        3       0       0        0      0

@Yutang-ETH
Copy link

Wow, what a fast reply @shenwei356. Thank you very much.

I know I am asking too much, but it would be great to also support -L just like -N so that we can calculate -L 50, 90. What do you think? I really appreciate your work!

Best wishes,
Yutang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants