add 'cluster' and 'cocluster' to sourmash #700

ctb · 2019-07-20T19:05:05Z

the cocluster script may be useful for people comparing the output of binning.

see also #459

ctb · 2019-07-20T19:06:56Z

% python cocluster.py --first podar-ref/63.fa.sig --second podar-ref/63.fa.sig podar-ref/2.fa.sig  -k 31 --cut-point=1.0
first list contains 1 files; second list contains 2 files.
... loading file 0 of 1 for first list
... loading file 1 of 2 for second list
ksize: 31 / moltype: DNA
downsampling to scaled value of 1000
first list contains 1 signatures; second list contains 2 signatures.
...comparing 3 signatures, all by all

0-NC_011663.1 She...    [1. 1. 0.]
1-NC_011663.1 She...    [1. 1. 0.]
2-CP001071.1 Akke...    [0. 0. 1.]
min similarity in matrix: 0.000
** wrote coclust dendrogram to sourmash.coclust.dendro.pdf
cluster 2 is 1 in size
         CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome
cluster 1 is 2 in size
         NC_011663.1 Shewanella baltica OS223, complete genome
         NC_011663.1 Shewanella baltica OS223, complete genome
** wrote coclust assignments spreadsheet to sourmash.coclust.csv

ctb · 2021-01-10T14:58:22Z

see also #1265, uniqify script, which I think is nice and simple.

ctb · 2021-03-04T14:19:07Z

may be good as a plugin test #1353

@mr-eyes

This PR adds a new command, `cluster`, that can be used to cluster the output from `pairwise` and `multisearch`. `cluster`uses `rustworkx-core` (which internally uses `petgraph`) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by `pairwise` or `multisearch`, and will add all nodes to the graph to preserve singleton 'clusters' in the output. `cluster` outputs two files: 1. cluster identities file: `Component_X, name1;name2;name3...` 2. cluster size histogram `cluster_size, count` context for some things I tried: - try using petgraph directly and removing rustworkx dependency > nope,`rustworkx-core` adds `connected_components` that returns the connected components, rather than just the number of connected components. Could reimplement if `rustworkx-core` brings in a lot of deps - try using 'extend_with_edges' instead of add_edge logic. > nope, only in `petgraph` **Punted Issues:** - develop clustering visualizations (ref @mr-eyes kSpider/dbretina work). Optionally output dot file of graph? (#248) - enable updating clusters, rather than always regenerating from scratch (#249) - benchmark `cluster` (#247) > `pairwise` files can be millions of lines long. Would it be faster to parallel read them, store them in an `edges` vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from `names_to_node` that are not already in the graph to preserve singletons. Related issues: * #219 * sourmash-bio/sourmash#2271 * sourmash-bio/sourmash#700 * sourmash-bio/sourmash#225 * sourmash-bio/sourmash#274 --------- Co-authored-by: C. Titus Brown <titus@idyll.org>

luizirber added idea enhancement and removed idea labels Aug 23, 2019

This was referenced Mar 4, 2021

consider adding dendrogram clustering/cut code into sourmash #459

Open

Look into clustering and clustering summarization #225

Open

ctb mentioned this issue Jul 31, 2023

sourmash plugins - ideas dumping ground #2453

Open

ctb added the plugin_todo Write a plugin for this! label Sep 23, 2023

ctb mentioned this issue Feb 26, 2024

MRG: Add graph-based clustering sourmash-bio/sourmash_plugin_branchwater#234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add 'cluster' and 'cocluster' to sourmash #700

add 'cluster' and 'cocluster' to sourmash #700

ctb commented Jul 20, 2019

ctb commented Jul 20, 2019 •

edited

Loading

ctb commented Jan 10, 2021

ctb commented Mar 4, 2021

add 'cluster' and 'cocluster' to sourmash #700

add 'cluster' and 'cocluster' to sourmash #700

Comments

ctb commented Jul 20, 2019

ctb commented Jul 20, 2019 • edited Loading

ctb commented Jan 10, 2021

ctb commented Mar 4, 2021

ctb commented Jul 20, 2019 •

edited

Loading