Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The coverage problem and (maybe) wrong cluster problem #1

Open
liaoherui opened this issue Feb 18, 2020 · 4 comments
Open

The coverage problem and (maybe) wrong cluster problem #1

liaoherui opened this issue Feb 18, 2020 · 4 comments

Comments

@liaoherui
Copy link

Thanks for your wonderful tool !

My problem is

  1. if there are the parameters related with the alignment coverage.
    For example,
    微信图片_20200218223602
    Just like the picture shows, the query genome and target genome share 99.46% identity but only 84% coverage. When I set the "-memiden" to 99, they will be assigned to the same cluster....

    So, if there are some parameters about the "coverage" threashold filtering?

  2. In my experiment ,there are 2 highly similar genome, their identity and coverage is displayed as below picture:
    微信图片_20200218224234
    However, when I set the "-memiden" to 99, they are assigned to different clusters, that really makes me confused...I am not sure what's going on...

(All the alignment in the picture is done by the online megablast alignment tool.)

@sclirl
Copy link
Contributor

sclirl commented Feb 19, 2020

Thank you very much for your interest in our software. For your problem, the possible causes and solutions are as follows:

  1. The calculation method of consistency in software may be slightly different from blastn software. Our definition is as follows: The extended MEMs identity (eMEMi) is calculated using the following formula:
    eMEMi = Nmatch / Lquery, where "Nmatch" is the number of matched nucleotides within extended MEMs and "Lquery" is the length of the shorter sequence.

Therefore, the premise of execution is that the length of the query sequence is longer than the representative sequence (that is, what you mean Targeting sequence) is short. You can check the consistency of the two sequences under our software and confirm the clustering results.

  1. The two alignment sequences you mentioned: ZKV_420 vs. Query_57683, and ZKV184 vs. Query_8205. Is it a public serial number? But we can't access it online, if you can provide the serial number or serial number if convenient.

I hope to help you, if you still have questions, please feel free to send an email to lirl@sccas.cn or niubf@cnic.cn, we are happy to discuss and communicate with you, thank you very much!

@liaoherui
Copy link
Author

Hi, sclirl, thanks for your fast reply!

For Problem 1, I almost understand what you mean. So in theory, if I set "-memiden" to 99 ,then for the genomes in the same cluster ,all of these genomes should have eMEMi >=99% to the representative genome (the longest one),right?

About the problem2, I have uploaded the fasta file I used to do the experiment. There are 648 Viral complete genomes in the fasta, ZKV_2 and ZKV_184 is the case that displayed in the Problem2 picture. They share high similarity but assigned to different cluster by Gclust.
The command I use is:
gclust -minlen 41 -both -nuc -threads 16 -chunk 400 -loadall -memiden 99 -rebuild -ext 1 -sparse 4 ZKV_rebuild_gclust_remove100.fasta > ZKV_rebuild.gclust.cutoff_99.cls

You can download the data and see what's going on in this case.
ZKV_rebuild_gclust_remove100.zip

@liaoherui
Copy link
Author

liaoherui commented Feb 19, 2020

Hi, sclirl

Sorry to say that I got the possible reason for Problem2... I forget to sort all the genomes before I run Gclust. I can get the right cluster after the sorting step for ZKV_2 and ZKV_184...

However, for ZKV_26 and ZKV_184 , the problem still exists even I sort the genome, they are very similar (>99% query cov and >99% identity with online megablast), but they are assigned to different clusters.... That makes me confused....

@sclirl
Copy link
Contributor

sclirl commented Feb 19, 2020

Hi, liaoherui,
Yes, your understanding of the "Problem 1" is correct, i.e. all of the genomes in a cluster should have eMEMi >=99% to the representative genome (the longest one) under the condition of '"-memiden 99'.

Problem2: You can set the parameters -minlen and -sparse to a smaller value. These two parameters have a greater impact on the clustering result, such as the recommended values: -minlen 21, -sparse 1 (or 2).

If there are other questions, we can communicate at any time, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants