benchmark `pairwise` --> `cluster` #247

bluegenes · 2024-02-27T21:04:14Z

could you add some minimal benchmarks (time/memory) for a standard-ish comparison, e.g. gtdb-reps, so that users know what to expect from both pairwise and cluster for a real-ish analysis? ISTR it's pretty fast against gtdb-reps.

If benchmark is slow, consider parallelizing reading. It was originally done in #234 but removed for simplicity.

pairwise files can be millions of lines long. Would it be faster to parallel read them, store them in an edges vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from names_to_node that are not already in the graph to preserve singletons.

The text was updated successfully, but these errors were encountered:

@mr-eyes

This PR adds a new command, `cluster`, that can be used to cluster the output from `pairwise` and `multisearch`. `cluster`uses `rustworkx-core` (which internally uses `petgraph`) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by `pairwise` or `multisearch`, and will add all nodes to the graph to preserve singleton 'clusters' in the output. `cluster` outputs two files: 1. cluster identities file: `Component_X, name1;name2;name3...` 2. cluster size histogram `cluster_size, count` context for some things I tried: - try using petgraph directly and removing rustworkx dependency > nope,`rustworkx-core` adds `connected_components` that returns the connected components, rather than just the number of connected components. Could reimplement if `rustworkx-core` brings in a lot of deps - try using 'extend_with_edges' instead of add_edge logic. > nope, only in `petgraph` **Punted Issues:** - develop clustering visualizations (ref @mr-eyes kSpider/dbretina work). Optionally output dot file of graph? (#248) - enable updating clusters, rather than always regenerating from scratch (#249) - benchmark `cluster` (#247) > `pairwise` files can be millions of lines long. Would it be faster to parallel read them, store them in an `edges` vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from `names_to_node` that are not already in the graph to preserve singletons. Related issues: * #219 * sourmash-bio/sourmash#2271 * sourmash-bio/sourmash#700 * sourmash-bio/sourmash#225 * sourmash-bio/sourmash#274 --------- Co-authored-by: C. Titus Brown <titus@idyll.org>

bluegenes · 2024-02-28T00:48:41Z

🚀 5 seconds on gtdb-rs214-reps with average_containment_ani default threshold (0.95)

I used 16 threads but %CPU was 123% (which makes sense, since cluster is not actually parallelized)

generating clusters for comparisons in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv' using 16 threads
...clustering is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.clusters.csv'
                       cluster counts in 'gtdb-rs214-reps.k31.pairwise-ani-all.clusters.sizes.csv'
        Command being timed: "sourmash scripts cluster gtdb-rs214-reps.k31.pairwise-ani-all.csv -o gtdb-rs214-reps.k31.pairwise-ani-all.clusters.csv --similarity-column average_containment_ani --cluster-sizes gtdb-rs214-reps.k31.pairwise-ani-all.clusters.sizes.csv"
        User time (seconds): 4.03
        System time (seconds): 2.07
        Percent of CPU this job got: 123%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.95
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 109292
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 36690
        Voluntary context switches: 3028
        Involuntary context switches: 424
        Swaps: 0
        File system inputs: 0
        File system outputs: 3264
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

doesn't change much for lowered threshold

generating clusters for comparisons in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv' using 16 threads
...clustering is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.clusters0.8.csv'
                       cluster counts in 'None'
        Command being timed: "sourmash scripts cluster gtdb-rs214-reps.k31.pairwise-ani-all.csv -o gtdb-rs214-reps.k31.pairwise-ani-all.clusters0.8.csv --similarity-column average_containment_ani --threshold 0.8"
        User time (seconds): 3.76
        System time (seconds): 1.87
        Percent of CPU this job got: 125%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.47
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 126164
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 41457
        Voluntary context switches: 3204
        Involuntary context switches: 716
        Swaps: 0
        File system inputs: 0
        File system outputs: 1968
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

bluegenes · 2024-02-28T00:56:07Z

pairwise to build cluster input file took much longer, of course. ~2 hours for gtdb-rs214-reps using 16 threads

No ANI, no write-all:

DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise.csv'
        Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip -o gtdb-rs214-reps.k31.pairwise.csv"
        User time (seconds): 143454.56
        System time (seconds): 136.08
        Percent of CPU this job got: 1562%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:33:08
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4573808
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 17555
        Minor (reclaiming a frame) page faults: 79509579
        Voluntary context switches: 1134599
        Involuntary context switches: 1262188
        Swaps: 0
        File system inputs: 4486144
        File system outputs: 412944
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

ANI, no write-all

DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise-ani.csv'
        Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip --ani -o gtdb-rs214-reps.k31.pairwise-ani.csv"
        User time (seconds): 143272.02
        System time (seconds): 80.51
        Percent of CPU this job got: 1562%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:32:51
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4573456
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 32007457
        Voluntary context switches: 1181205
        Involuntary context switches: 1298635
        Swaps: 0
        File system inputs: 0
        File system outputs: 528008
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

ANI + write-all:

DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv'
        Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip --write-all --ani -o gtdb-rs214-reps.k31.pairwise-ani-all.csv"
        User time (seconds): 107618.34
        System time (seconds): 245.74
        Percent of CPU this job got: 1551%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:55:51
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4575736
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 89
        Minor (reclaiming a frame) page faults: 63129113
        Voluntary context switches: 1118873
        Involuntary context switches: 1699826
        Swaps: 0
        File system inputs: 13792
        File system outputs: 547384
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

ctb · 2024-02-28T13:51:39Z

it seems a little weird that ANI + write-all took half an hour less wall time, right? But that could just be fluctuations on the computer running things.

bluegenes · 2024-03-21T00:12:17Z

benchmarking pairwise using GTDB-rs214 reps on 64 threads for comparison with multisearch (#89)

85205 x 85205 pairwise comparisons (3.6 billion comparisons non-self, non-redundant comparisons) in 44m with 64 threads (and 4.56 GB RAM).

DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv'
        Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip --write-all --ani -o gtdb-rs214-reps.k31.pairwise-ani-all.csv"
        User time (seconds): 149275.64
        System time (seconds): 54.49
        Percent of CPU this job got: 5612%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 44:20.68
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4566188
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 18246
        Minor (reclaiming a frame) page faults: 6700145
        Voluntary context switches: 1193450
        Involuntary context switches: 1579877
        Swaps: 0
        File system inputs: 4610752
        File system outputs: 547336
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

bluegenes mentioned this issue Feb 27, 2024

MRG: Add graph-based clustering #234

Merged

bluegenes changed the title ~~benchmark cluster~~ benchmark pairwise --> cluster Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark `pairwise` --> `cluster` #247

benchmark `pairwise` --> `cluster` #247

bluegenes commented Feb 27, 2024 •

edited

Loading

bluegenes commented Feb 28, 2024 •

edited

Loading

bluegenes commented Feb 28, 2024 •

edited

Loading

ctb commented Feb 28, 2024

bluegenes commented Mar 21, 2024 •

edited

Loading

benchmark pairwise --> cluster #247

benchmark pairwise --> cluster #247

Comments

bluegenes commented Feb 27, 2024 • edited Loading

bluegenes commented Feb 28, 2024 • edited Loading

bluegenes commented Feb 28, 2024 • edited Loading

ctb commented Feb 28, 2024

bluegenes commented Mar 21, 2024 • edited Loading

benchmark `pairwise` --> `cluster` #247

benchmark `pairwise` --> `cluster` #247

bluegenes commented Feb 27, 2024 •

edited

Loading

bluegenes commented Feb 28, 2024 •

edited

Loading

bluegenes commented Feb 28, 2024 •

edited

Loading

bluegenes commented Mar 21, 2024 •

edited

Loading