MRG: Add graph-based clustering #234

bluegenes · 2024-02-22T01:00:21Z

This PR adds a new command, cluster, that can be used to cluster the output from pairwise and multisearch.

clusteruses rustworkx-core (which internally uses petgraph) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by pairwise or multisearch, and will add all nodes to the graph to preserve singleton 'clusters' in the output.

cluster outputs two files:

cluster identities file: Component_X, name1;name2;name3...
cluster size histogram cluster_size, count

context for some things I tried:

try using petgraph directly and removing rustworkx dependency

nope,rustworkx-core adds connected_components that returns the connected components, rather than just the number of connected components. Could reimplement if rustworkx-core brings in a lot of deps

try using 'extend_with_edges' instead of add_edge logic.

nope, only in petgraph

Punted Issues:

develop clustering visualizations (ref @mr-eyes kSpider/dbretina work). Optionally output dot file of graph? (cluster: develop downstream usage and visualization #248)
enable updating clusters, rather than always regenerating from scratch (cluster: enable updating clusters #249)
benchmark cluster (benchmark pairwise --> cluster #247)

pairwise files can be millions of lines long. Would it be faster to parallel read them, store them in an edges vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from names_to_node that are not already in the graph to preserve singletons.

Related issues:

@mr-eyes

Adds ANI columns to pairwise and multisearch output, building off of @mr-eyes's ANI translations (#188) which got streamlined + added into sourmash core in sourmash-bio/sourmash#2943 Split from #234 to make things more concise/simpler. ## benchmark summary ## 12k ICTV viral genomes, scaled=200 79.8m comparisons | version | experiment | time | | -------- | -------- | -------- | | PR | no ANI | 21s | | PR | with ANI | 20s | | v0.9.0 | no ANI | 21s | ## 12k ICTV viral genomes, scaled=10 79.8m comparisons | version | experiment | time | | -------- | -------- | -------- | | PR | no ANI | 14m 0s | | PR | with ANI | 14m 6s | | main branch (~v0.9.0) | no ANI | 14m 47s |

bluegenes · 2024-02-26T19:26:00Z

ok, so just to be explicit: our intent is that the same column names mean the same thing as in sourmash now, and that's how this PR does things?

yes.

There is still one difference -- in branchwater we are reporting ANI as percent (*100), whereas in sourmash we report it as a fraction. This is something I'd like to propose changing in sourmash at some point, but I could go back to fraction to keep it standardized.

please do 😆

done, sigh.

Is there a specific reason for representing ANI with 0-100% instead of 0-1 fractions?

more biologist friendly and similar to what many other tools report. e.g. when outputting kraken report format in sourmash, i convert to a percent to keep as close to possible as the original format.

ctb · 2024-02-26T19:28:20Z

There is still one difference -- in branchwater we are reporting ANI as percent (*100), whereas in sourmash we report it as a fraction. This is something I'd like to propose changing in sourmash at some point, but I could go back to fraction to keep it standardized.

please do 😆

done, sigh.

😭 much appreciated

Is there a specific reason for representing ANI with 0-100% instead of 0-1 fractions?

more biologist friendly and similar to what many other tools report. e.g. when outputting kraken report format in sourmash, i convert to a percent to keep as close to possible as the original format.

I don't think we need to be particularly biologist friendly in our raw CSVs. Syntactic and semantic mismatches are a huge problem for our own internal toolset tho!

mr-eyes · 2024-02-26T19:32:03Z

more biologist friendly and similar to what many other tools report. e.g. when outputting kraken report format in sourmash, i convert to a percent to keep as close to possible as the original format.

I am afraid this will introduce inconsistencies/confusion in handling the output files. We will need to take care of this information when doing any downstream processing.
In this PR, you are using ANI as fractions, not percentages. Even if reported as percentages, they are still required to follow the fraction format when doing the user-defined CLI cutoff on creating the edges.

bluegenes · 2024-02-26T19:38:10Z

In this PR, you are using ANI as fractions, not percentages. Even if reported as percentages, they are still required to follow the fraction format when doing the user-defined CLI cutoff on creating the edges.

I previously (in ANI PR) had it all in percentage, including the CLI for cluster. No matter, it is all fractions now :).

minor regret about my initial decision when I added ANI to sourmash 🤷🏻‍♀️😅

ctb · 2024-02-26T20:04:26Z

minor regret about my initial decision when I added ANI to sourmash 🤷🏻‍♀️

lol well if we're going to talk about regrets for early design decisions I'm sure I have a list somewhere...

ctb · 2024-02-27T14:35:39Z

I ran pairwise and cluster just now. Ran into a few hiccups, fixed them here: #245

A few comments/questions/requests:

could you add a test for -h?
could you add a test that uses the default column name (so, tests for the other problem I found 😉 )
could you add some minimal benchmarks (time/memory) for a standard-ish comparison, e.g. gtdb-reps, so that users know what to expect from both pairwise and cluster for a real-ish analysis? ISTR it's pretty fast against gtdb-reps. (It's OK to punt this one to an issue for later.)

Other thoughts:

it surprised me that pairwise does not output self-matches, but maybe that's ok?
the downstream consequence of that was that the cluster output only contained the vast minority of sketch names! which was the real surprise. Might be mentioned in the documentation.
it is surprising to me that --cluster-sizes is required. What do you think about making it optional?

What I ran

sourmash scripts pairwise podar-ref.zip -o podar-ref-cmp.csv --ani
sourmash scripts cluster podar-ref-cmp.csv -o podar-ref-cmp.cluster.csv --cluster-sizes podar-ref-cmp.cluster.hist.csv

ctb · 2024-02-27T14:35:54Z

(and, I mean, y'know - nice work! :)

bluegenes · 2024-02-27T16:01:29Z

the downstream consequence of that was that the cluster output only contained the vast minority of sketch names! which was the real surprise. Might be mentioned in the documentation.

This was initially confusing to me, as cluster keeps all nodes it sees (only doesn't add edges if they don't meet threshold). But of course, sketches with no similarity at all to other sketches will not appear in pairwise at all. It is probably worth evaluating the performance of writing self similarity to make cluster output robust. Otherwise I would probably default to using multisearch for clustering instead -- ideal scenario is to have all sketch names in the output!

ctb · 2024-02-27T16:06:51Z

the downstream consequence of that was that the cluster output only contained the vast minority of sketch names! which was the real surprise. Might be mentioned in the documentation.

This was initially confusing to me, as cluster keeps all nodes it sees (only doesn't add edges if they don't meet threshold). But of course, sketches with no similarity at all to other sketches will not appear in pairwise at all. It is probably worth evaluating the performance of writing self similarity to make cluster output robust. Otherwise I would probably default to using multisearch for clustering instead -- ideal scenario is to have all sketch names in the output!

exactly! 😄

This is why I think a note in the docs (either in pairwise, or in cluster, or both?) is a good idea. If it is a persistent confusion we can always add an option to pairwise.

ctb · 2024-02-27T17:46:14Z

I think this is ready to go, right?

ctb

⭐

bluegenes · 2024-02-27T17:51:24Z

just need to run the benchmarking!

ctb · 2024-02-27T17:53:10Z

just need to run the benchmarking!

you can punt to an issue... ;)

bluegenes · 2024-02-27T21:02:56Z

you can punt to an issue... ;)

will do!

bluegenes added 3 commits February 21, 2024 16:31

compiling cmds

c57550f

working test

9ebef77

rm rust tests in favor of python tests

c69a6f0

bluegenes mentioned this pull request Feb 22, 2024

Adapt the kSpider's algorithm in pairwise comparisons #219

Open

rm comment

5628845

bluegenes force-pushed the add-cluster branch from ee19b06 to 5628845 Compare February 22, 2024 01:10

bluegenes added 13 commits February 22, 2024 17:21

add ani to pairwise and multisearch

9eb62ae

add ani testing for multisearch

23b35be

add ani to test csv

99b02a2

Merge branch 'main' into add-cluster

5efa56d

upd tests; also output cluster size histogram

53dc312

add tests for max ani, max contain

6929f50

keep singletons!

4424f41

multithreaded read records

b9926d1

upd ani

c96c15a

test bad,empty input

e880a6b

rustfmt, clippy

9eab7ea

fix for percent ani

1b64a53

make ani default

98cbc27

bluegenes changed the title ~~WIP: Add rustworkx-based clustering~~ WIP: Add graph-based clustering Feb 24, 2024

actually, keep read sequential for now

ba2084d

bluegenes mentioned this pull request Feb 24, 2024

MRG: Add ANI output to pairwise, multisearch #236

Merged

bluegenes changed the base branch from main to add-ani February 24, 2024 17:48

Base automatically changed from add-ani to main February 26, 2024 16:46

bluegenes added 5 commits February 26, 2024 08:48

replace ani calc with split br

3a80062

Merge branch 'main' into add-cluster

170b901

adjust+test for optional ANI; use cANI terminology

1c8bd63

print underlying errors from graph building

49c249c

Merge branch 'main' into add-cluster

e497cdf

ctb mentioned this pull request Feb 27, 2024

MRG: fix argparse stuff in cluster command #245

Merged

ctb and others added 2 commits February 27, 2024 06:57

fix misc argparse things (#245)

3fa24da

make sizes optional; test help and defaults

a5df3cd

add multisearch to help

3b4e0e8

add option to write all results

894237b

ctb approved these changes Feb 27, 2024

View reviewed changes

bluegenes and others added 2 commits February 27, 2024 09:57

Merge branch 'main' into add-cluster

2dea5ab

fix pw --write-all for no ani

f79a35b

bluegenes mentioned this pull request Feb 27, 2024

benchmark pairwise --> cluster #247

Open

fix help for similarity, not distance

af88b59

bluegenes merged commit 6d754b5 into main Feb 27, 2024
1 check passed

bluegenes deleted the add-cluster branch February 27, 2024 21:14

ctb mentioned this pull request Feb 27, 2024

MRG: PR to release v0.9.1 #250

Merged

mr-eyes mentioned this pull request Feb 27, 2024

pairwise/clustering downstream-analysis research-driven thoughts #252

Open

bluegenes mentioned this pull request Feb 29, 2024

WIP: use identical column names as sourmash gather #259

Closed

mr-eyes mentioned this pull request Mar 7, 2024

updated usage example dib-lab/kSpider#39

Open

ctb mentioned this pull request May 14, 2024

sourmash compare runs out of memory on large comparisons sourmash-bio/sourmash#3134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG: Add graph-based clustering #234

MRG: Add graph-based clustering #234

bluegenes commented Feb 22, 2024 •

edited

Loading

bluegenes commented Feb 26, 2024

ctb commented Feb 26, 2024 •

edited

Loading

mr-eyes commented Feb 26, 2024 •

edited

Loading

bluegenes commented Feb 26, 2024 •

edited

Loading

ctb commented Feb 26, 2024

ctb commented Feb 27, 2024

ctb commented Feb 27, 2024

bluegenes commented Feb 27, 2024

ctb commented Feb 27, 2024

ctb commented Feb 27, 2024

ctb left a comment

bluegenes commented Feb 27, 2024

ctb commented Feb 27, 2024

bluegenes commented Feb 27, 2024

MRG: Add graph-based clustering #234

MRG: Add graph-based clustering #234

Conversation

bluegenes commented Feb 22, 2024 • edited Loading

bluegenes commented Feb 26, 2024

ctb commented Feb 26, 2024 • edited Loading

mr-eyes commented Feb 26, 2024 • edited Loading

bluegenes commented Feb 26, 2024 • edited Loading

ctb commented Feb 26, 2024

ctb commented Feb 27, 2024

What I ran

ctb commented Feb 27, 2024

bluegenes commented Feb 27, 2024

ctb commented Feb 27, 2024

ctb commented Feb 27, 2024

ctb left a comment

Choose a reason for hiding this comment

bluegenes commented Feb 27, 2024

ctb commented Feb 27, 2024

bluegenes commented Feb 27, 2024

bluegenes commented Feb 22, 2024 •

edited

Loading

ctb commented Feb 26, 2024 •

edited

Loading

mr-eyes commented Feb 26, 2024 •

edited

Loading

bluegenes commented Feb 26, 2024 •

edited

Loading