some simple benchmarking of `sourmash gather` on GTDB zipfiles/SBTs #1530

ctb · 2021-05-17T13:30:24Z

While writing a blog post about the sourmash v4.1 release, I got curious about the practical implications of --linear/--no-linear and --prefetch/--no-prefetch, so I ran a following benchmark script and recorded the output. The benchmark script and raw output are at the bottom.

The query signature here was a merge of four signatures that were present in the database, so gather would do four iterations.

Summary:

Zipfile collection

	Time (s)	Memory (mb)
no-linear/prefetch	207s	81mb
linear/prefetch	205s	81mb
no-linear/no-prefetch	811s	87mb
linear/no-prefetch	802s	86mb

Indexed zipfile (SBT):

	Time (s)	Memory (mb)
no-linear/prefetch	10s	215mb
linear/prefetch	177s	1502mb
no-linear/no-prefetch	22s	214mb
linear/no-prefetch	187s	1505mb

conclusions

so I think I understand almost everything here, which is good, since I wrote a lot of the code 😆 -

for the zipfile collection, four passes were needed if prefetch wasn't used, so the time was 4x for --no-prefetch;
for the zipfile collection, --no-linear and linear are identical;
for the SBT zip, linear is way slower than using the index, of course!

but the two weird results are for the SBT:

linear/no-prefetch and linear/prefetch are almost the same? no-prefetch should require multiple passes... is it maybe that all the signatures are being loaded in the first time, so after the first few queries
and why is there so much more memory usage with --linear than with --no-linear?

so my hypothesis (a theory we can test! :dora:) is that the SBT .signatures() method is keeping all the sigs in memory. The puzzling thing is that the memory usage is so high for that - maybe it's keeping the tree in memory, too, or something?

Anyway, the two big conclusions are the obvious ones and also reflect the defaults for sourmash:

--no-linear --prefetch is generally best;
use --prefetch by default;
if you want low memory, use zipfile collections. if you want speed, use an indexed database;

script and raw output

# bench.sh
set -x
set -e
# all four combinations with a zipfile (no index)
/usr/bin/time -v sourmash gather --no-linear --prefetch out.sig gtdb-rs202.genomic-reps.k31.zip
/usr/bin/time -v sourmash gather --linear --prefetch out.sig gtdb-rs202.genomic-reps.k31.zip
/usr/bin/time -v sourmash gather --no-linear --no-prefetch out.sig gtdb-rs202.genomic-reps.k31.zip
/usr/bin/time -v sourmash gather --linear --no-prefetch out.sig gtdb-rs202.genomic-reps.k31.zip

# all four combinations with an SBT (indexed)
/usr/bin/time -v sourmash gather --no-linear --prefetch out.sig gtdb-rs202.genomic-reps.k31.sbt.zip
/usr/bin/time -v sourmash gather --linear --prefetch out.sig gtdb-rs202.genomic-reps.k31.sbt.zip
/usr/bin/time -v sourmash gather --no-linear --no-prefetch out.sig gtdb-rs202.genomic-reps.k31.sbt.zip
/usr/bin/time -v sourmash gather --linear --no-prefetch out.sig gtdb-rs202.genomic-reps.k31.sbt.zip

Raw output attached.

bench.txt

The text was updated successfully, but these errors were encountered:

ctb · 2021-05-18T04:33:21Z

on the full SBT:

% /usr/bin/time -v sourmash gather --no-linear --prefetch out.sig gtdb-rs202.genomic.k31.sbt.zip
       User time (seconds): 63.27
        System time (seconds): 15.34
        Percent of CPU this job got: 20%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 6:14.66
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1020764
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 21
        Minor (reclaiming a frame) page faults: 614443
        Voluntary context switches: 45296
        Involuntary context switches: 735935
        Swaps: 0
        File system inputs: 5477584
        File system outputs: 129016
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

so 60 seconds and 1 GB to search 300k signatures?!

ctb · 2021-05-18T04:34:28Z

full genomic SBT is 15 GB for the 300k sigs.

ctb · 2021-05-18T13:39:40Z

ran

/usr/bin/time -v sourmash gather --no-linear --prefetch out.sig gtdb-rs202.genomic.k31.sbt.zip
/usr/bin/time -v sourmash gather --linear --prefetch out.sig gtdb-rs202.genomic.k31.sbt.zip
/usr/bin/time -v sourmash gather --no-linear --no-prefetch out.sig gtdb-rs202.genomic.k31.sbt.zip
/usr/bin/time -v sourmash gather --linear --no-prefetch out.sig gtdb-rs202.genomic.k31.sbt.zip

this is on ~280k sigs, GTDB all.

ctb · 2021-05-18T13:41:01Z

Results on Really Big files (15 GB .sbt.zip for ~280k all-GTDB)

	Time	Memory
no-linear/prefetch	4m 56s	1 GB
linear/prefetch	20m	8.8 GB
no-linear/no-prefetch	1h 16m	1 GB
linear/no-prefetch	20m 33s	8.8 GB

ctb · 2021-05-21T14:26:12Z

I'll label this with FAQ and leave this here.

ctb · 2022-05-04T14:14:51Z

integrated into docs with #2025.

ctb mentioned this issue May 18, 2021

[MRG] unload data on iteration over SBT leaves #1534

Merged

bluegenes added the benchmarking label Jun 11, 2021

This was referenced Apr 29, 2022

benchmarks for different database formats. #1958

Closed

provide more comprehensive/useful database benchmarks #2014

Open

ctb closed this as completed May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some simple benchmarking of `sourmash gather` on GTDB zipfiles/SBTs #1530

some simple benchmarking of `sourmash gather` on GTDB zipfiles/SBTs #1530

ctb commented May 17, 2021

ctb commented May 18, 2021

ctb commented May 18, 2021

ctb commented May 18, 2021

ctb commented May 18, 2021 •

edited

Loading

ctb commented May 21, 2021

ctb commented May 4, 2022

some simple benchmarking of sourmash gather on GTDB zipfiles/SBTs #1530

some simple benchmarking of sourmash gather on GTDB zipfiles/SBTs #1530

Comments

ctb commented May 17, 2021

Summary:

Zipfile collection

Indexed zipfile (SBT):

conclusions

script and raw output

ctb commented May 18, 2021

ctb commented May 18, 2021

ctb commented May 18, 2021

ctb commented May 18, 2021 • edited Loading

Results on Really Big files (15 GB .sbt.zip for ~280k all-GTDB)

ctb commented May 21, 2021

ctb commented May 4, 2022

some simple benchmarking of `sourmash gather` on GTDB zipfiles/SBTs #1530

some simple benchmarking of `sourmash gather` on GTDB zipfiles/SBTs #1530

ctb commented May 18, 2021 •

edited

Loading