`cluster`: enable updating clusters #249

bluegenes · 2024-02-27T21:06:21Z

In cluster, we build the graph from scratch each time. Would be great to allow input of another set of clusters or an existing graph that could be updated.

ctb · 2024-02-27T21:10:05Z

references:

can we update clustering results with new signatures? sourmash#2272

@mr-eyes

This PR adds a new command, `cluster`, that can be used to cluster the output from `pairwise` and `multisearch`. `cluster`uses `rustworkx-core` (which internally uses `petgraph`) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by `pairwise` or `multisearch`, and will add all nodes to the graph to preserve singleton 'clusters' in the output. `cluster` outputs two files: 1. cluster identities file: `Component_X, name1;name2;name3...` 2. cluster size histogram `cluster_size, count` context for some things I tried: - try using petgraph directly and removing rustworkx dependency > nope,`rustworkx-core` adds `connected_components` that returns the connected components, rather than just the number of connected components. Could reimplement if `rustworkx-core` brings in a lot of deps - try using 'extend_with_edges' instead of add_edge logic. > nope, only in `petgraph` **Punted Issues:** - develop clustering visualizations (ref @mr-eyes kSpider/dbretina work). Optionally output dot file of graph? (#248) - enable updating clusters, rather than always regenerating from scratch (#249) - benchmark `cluster` (#247) > `pairwise` files can be millions of lines long. Would it be faster to parallel read them, store them in an `edges` vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from `names_to_node` that are not already in the graph to preserve singletons. Related issues: * #219 * sourmash-bio/sourmash#2271 * sourmash-bio/sourmash#700 * sourmash-bio/sourmash#225 * sourmash-bio/sourmash#274 --------- Co-authored-by: C. Titus Brown <titus@idyll.org>

bluegenes mentioned this issue Feb 27, 2024

MRG: Add graph-based clustering #234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cluster`: enable updating clusters #249

`cluster`: enable updating clusters #249

bluegenes commented Feb 27, 2024

ctb commented Feb 27, 2024

cluster: enable updating clusters #249

cluster: enable updating clusters #249

Comments

bluegenes commented Feb 27, 2024

ctb commented Feb 27, 2024

`cluster`: enable updating clusters #249

`cluster`: enable updating clusters #249