Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: Add graph-based clustering #234

Merged
merged 36 commits into from
Feb 27, 2024
Merged

MRG: Add graph-based clustering #234

merged 36 commits into from
Feb 27, 2024

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Feb 22, 2024

This PR adds a new command, cluster, that can be used to cluster the output from pairwise and multisearch.

clusteruses rustworkx-core (which internally uses petgraph) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by pairwise or multisearch, and will add all nodes to the graph to preserve singleton 'clusters' in the output.

cluster outputs two files:

  1. cluster identities file: Component_X, name1;name2;name3...
  2. cluster size histogram cluster_size, count

context for some things I tried:

  • try using petgraph directly and removing rustworkx dependency

nope,rustworkx-core adds connected_components that returns the connected components, rather than just the number of connected components. Could reimplement if rustworkx-core brings in a lot of deps

  • try using 'extend_with_edges' instead of add_edge logic.

nope, only in petgraph

Punted Issues:

pairwise files can be millions of lines long. Would it be faster to parallel read them, store them in an edges vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from names_to_node that are not already in the graph to preserve singletons.

Related issues:

@bluegenes bluegenes changed the title WIP: Add rustworkx-based clustering WIP: Add graph-based clustering Feb 24, 2024
@bluegenes bluegenes changed the base branch from main to add-ani February 24, 2024 17:48
Base automatically changed from add-ani to main February 26, 2024 16:46
bluegenes added a commit that referenced this pull request Feb 26, 2024
Adds ANI columns to pairwise and multisearch output, building off of @mr-eyes's ANI translations (#188) which got streamlined + added into sourmash core in sourmash-bio/sourmash#2943

Split from #234 to make things more concise/simpler.

## benchmark summary

## 12k ICTV viral genomes, scaled=200

79.8m comparisons

| version | experiment | time |
| -------- | -------- | -------- |
| PR | no ANI     | 21s     | 
| PR | with ANI     | 20s     |
| v0.9.0 | no ANI | 21s |


## 12k ICTV viral genomes, scaled=10

79.8m comparisons

| version | experiment | time |
| -------- | -------- | -------- |
| PR | no ANI     | 14m 0s     | 
| PR | with ANI     | 14m 6s     | 
| main branch (~v0.9.0) | no ANI | 14m 47s |
@bluegenes
Copy link
Contributor Author

ok, so just to be explicit: our intent is that the same column names mean the same thing as in sourmash now, and that's how this PR does things?

yes.

There is still one difference -- in branchwater we are reporting ANI as percent (*100), whereas in sourmash we report it as a fraction. This is something I'd like to propose changing in sourmash at some point, but I could go back to fraction to keep it standardized.

please do 😆

done, sigh.

Is there a specific reason for representing ANI with 0-100% instead of 0-1 fractions?

more biologist friendly and similar to what many other tools report. e.g. when outputting kraken report format in sourmash, i convert to a percent to keep as close to possible as the original format.

@ctb
Copy link
Collaborator

ctb commented Feb 26, 2024

There is still one difference -- in branchwater we are reporting ANI as percent (*100), whereas in sourmash we report it as a fraction. This is something I'd like to propose changing in sourmash at some point, but I could go back to fraction to keep it standardized.

please do 😆

done, sigh.

😭 much appreciated

Is there a specific reason for representing ANI with 0-100% instead of 0-1 fractions?

more biologist friendly and similar to what many other tools report. e.g. when outputting kraken report format in sourmash, i convert to a percent to keep as close to possible as the original format.

I don't think we need to be particularly biologist friendly in our raw CSVs. Syntactic and semantic mismatches are a huge problem for our own internal toolset tho!

@mr-eyes
Copy link
Member

mr-eyes commented Feb 26, 2024

more biologist friendly and similar to what many other tools report. e.g. when outputting kraken report format in sourmash, i convert to a percent to keep as close to possible as the original format.

I am afraid this will introduce inconsistencies/confusion in handling the output files. We will need to take care of this information when doing any downstream processing.
In this PR, you are using ANI as fractions, not percentages. Even if reported as percentages, they are still required to follow the fraction format when doing the user-defined CLI cutoff on creating the edges.

@bluegenes
Copy link
Contributor Author

bluegenes commented Feb 26, 2024

In this PR, you are using ANI as fractions, not percentages. Even if reported as percentages, they are still required to follow the fraction format when doing the user-defined CLI cutoff on creating the edges.

I previously (in ANI PR) had it all in percentage, including the CLI for cluster. No matter, it is all fractions now :).

minor regret about my initial decision when I added ANI to sourmash 🤷🏻‍♀️😅

@ctb
Copy link
Collaborator

ctb commented Feb 26, 2024

minor regret about my initial decision when I added ANI to sourmash 🤷🏻‍♀️

lol well if we're going to talk about regrets for early design decisions I'm sure I have a list somewhere...

@ctb
Copy link
Collaborator

ctb commented Feb 27, 2024

I ran pairwise and cluster just now. Ran into a few hiccups, fixed them here: #245

A few comments/questions/requests:

  • could you add a test for -h?
  • could you add a test that uses the default column name (so, tests for the other problem I found 😉 )
  • could you add some minimal benchmarks (time/memory) for a standard-ish comparison, e.g. gtdb-reps, so that users know what to expect from both pairwise and cluster for a real-ish analysis? ISTR it's pretty fast against gtdb-reps. (It's OK to punt this one to an issue for later.)

Other thoughts:

  • it surprised me that pairwise does not output self-matches, but maybe that's ok?
  • the downstream consequence of that was that the cluster output only contained the vast minority of sketch names! which was the real surprise. Might be mentioned in the documentation.
  • it is surprising to me that --cluster-sizes is required. What do you think about making it optional?

What I ran

sourmash scripts pairwise podar-ref.zip -o podar-ref-cmp.csv --ani
sourmash scripts cluster podar-ref-cmp.csv -o podar-ref-cmp.cluster.csv --cluster-sizes podar-ref-cmp.cluster.hist.csv

@ctb
Copy link
Collaborator

ctb commented Feb 27, 2024

(and, I mean, y'know - nice work! :)

@bluegenes
Copy link
Contributor Author

the downstream consequence of that was that the cluster output only contained the vast minority of sketch names! which was the real surprise. Might be mentioned in the documentation.

This was initially confusing to me, as cluster keeps all nodes it sees (only doesn't add edges if they don't meet threshold). But of course, sketches with no similarity at all to other sketches will not appear in pairwise at all. It is probably worth evaluating the performance of writing self similarity to make cluster output robust. Otherwise I would probably default to using multisearch for clustering instead -- ideal scenario is to have all sketch names in the output!

@ctb
Copy link
Collaborator

ctb commented Feb 27, 2024

the downstream consequence of that was that the cluster output only contained the vast minority of sketch names! which was the real surprise. Might be mentioned in the documentation.

This was initially confusing to me, as cluster keeps all nodes it sees (only doesn't add edges if they don't meet threshold). But of course, sketches with no similarity at all to other sketches will not appear in pairwise at all. It is probably worth evaluating the performance of writing self similarity to make cluster output robust. Otherwise I would probably default to using multisearch for clustering instead -- ideal scenario is to have all sketch names in the output!

exactly! 😄

This is why I think a note in the docs (either in pairwise, or in cluster, or both?) is a good idea. If it is a persistent confusion we can always add an option to pairwise.

@ctb
Copy link
Collaborator

ctb commented Feb 27, 2024

I think this is ready to go, right?

Copy link
Collaborator

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bluegenes
Copy link
Contributor Author

just need to run the benchmarking!

@ctb
Copy link
Collaborator

ctb commented Feb 27, 2024

just need to run the benchmarking!

you can punt to an issue... ;)

@bluegenes
Copy link
Contributor Author

you can punt to an issue... ;)

will do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants