Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Oxidize ZipStorage #1909

Merged
merged 12 commits into from
Apr 8, 2022
Merged

[MRG] Oxidize ZipStorage #1909

merged 12 commits into from
Apr 8, 2022

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Mar 30, 2022

Expose the (read-only for now) Rust ZipStorage and use it instead of the regular (read-write) ZipStorage.

if writing is needed, fall back to current ZipStorage (as a stopgap while the Rust one doesn't support writing)

This PR also removes Python 3.7 support (which we actually dropped in #1839), and adds Python 3.9 and Python 3.10 to the setup.cfg classifiers.

@codecov
Copy link

codecov bot commented Mar 30, 2022

Codecov Report

Merging #1909 (1784ffb) into latest (1229dc1) will decrease coverage by 0.29%.
The diff coverage is 65.34%.

@@            Coverage Diff             @@
##           latest    #1909      +/-   ##
==========================================
- Coverage   83.23%   82.93%   -0.30%     
==========================================
  Files         124      125       +1     
  Lines       13606    13755     +149     
  Branches     1863     1877      +14     
==========================================
+ Hits        11325    11408      +83     
- Misses       2014     2075      +61     
- Partials      267      272       +5     
Flag Coverage Δ
python 90.96% <88.29%> (-0.06%) ⬇️
rust 65.13% <45.37%> (-0.68%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/ffi/mod.rs 0.00% <ø> (ø)
src/core/src/ffi/storage.rs 0.00% <0.00%> (ø)
src/core/src/storage.rs 83.01% <71.15%> (-6.86%) ⬇️
src/sourmash/sbt_storage.py 89.15% <86.25%> (-1.80%) ⬇️
src/core/tests/storage.rs 100.00% <100.00%> (ø)
src/sourmash/index/__init__.py 96.72% <100.00%> (-0.01%) ⬇️
src/sourmash/sbt.py 85.37% <100.00%> (+0.03%) ⬆️
src/sourmash/sourmash_args.py 93.40% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1229dc1...1784ffb. Read the comment docs.

@luizirber luizirber changed the title [WIP] Oxidize ZipStorage [MRG] Oxidize ZipStorage Apr 2, 2022
@luizirber
Copy link
Member Author

I was going to start adding more changes to this PR, but probably better to merge this now and improve later, especially because:

  • It has an asv benchmark for ZipStorage loading, which makes it easier for checking perf in other PRs. It has a particularly egregious pattern for the Rust ZipStorage (the path being loaded from storage is the last one, and in Rust it is doing a linear search to find the entry to load from the zip file), so optimizations on that area will be very visible =]
  • It is already faster in the benchmark? I was a bit surprised about this one
This PR latest
benchmarks.TimeZipStorageSuite.time_load_from_zipstorage 258±7ms 611±0.9ms

Before merging:

  • can someone take it for a ride with large zipfiles? I did sourmash search GCA_016787465.1.sig genbank-2022.03-protozoa-k31.zip for latest and this PR and they take about the same time.

ready for review @sourmash-bio/devs

@mr-eyes mr-eyes self-requested a review April 3, 2022 01:26
@mr-eyes mr-eyes self-assigned this Apr 3, 2022
@ctb
Copy link
Contributor

ctb commented Apr 3, 2022

did a sourmash gather with a cut-down version of SRR606249 against GTDB RS202 genomic-reps, worked great on my laptop!

I will see if I can do some benchmarking tomorrow. But overall this PR looks very good, at least on the Python side.

@ctb
Copy link
Contributor

ctb commented Apr 3, 2022

Notes to maybe add to the top description:

  • this removes Python 3.7 support, and adds Python 3.9 and Python 3.10 to the setup.cfg classifiers;

Notes from a review of the Python code:

  • a lot of the Python code for ZipStorage is no longer covered by the tests. Can we adjusts the code and/or tests to still test it, perhaps using a test fixture to do both?

Everything looks good to me, other than that! I'd appreciate a review of the Rust code tho, @mr-eyes :)

One other thought: I'm pretty sure that with manifests this:

in Rust it is doing a linear search to find the entry to load from the zip file

will not be a problem :). But in any case the benchmark suggests that it's already not a problem even when there are no manifests 🎉

@mr-eyes
Copy link
Member

mr-eyes commented Apr 3, 2022

Everything looks good to me, other than that! I'd appreciate a review of the Rust code tho, @mr-eyes :)

Working on it =]

@ctb
Copy link
Contributor

ctb commented Apr 3, 2022

Everything looks good to me, other than that! I'd appreciate a review of the Rust code tho, @mr-eyes :)

Working on it =]

No hurry! And please wait to approve until the issues I noted are addressed ;)

@luizirber
Copy link
Member Author

luizirber commented Apr 3, 2022

a lot of the Python code for ZipStorage is no longer covered by the tests. Can we adjusts the code and/or tests to still test it, perhaps using a test fixture to do both?

Indeed. I will make a fixture for zipstorage tests setting up the mode (to r and w), so at least it checks both impls for now.

One other thought: I'm pretty sure that with manifests this:

in Rust it is doing a linear search to find the entry to load from the zip file

will not be a problem :). But in any case the benchmark suggests that it's already not a problem even when there are no manifests tada

So... it is unrelated to manifests.

For gory details:

piz, the library I'm using for reading zip files, has some lifetime annotations that make it pretty hard to add to a struct (ZipStorage, in this case) while keeping track of the source of the zip data. Explaining with the definition:

pub struct ZipStorage<'a> {
    mapping: Option<memmap2::Mmap>,
    archive: Option<piz::ZipArchive<'a>>,
}

mapping is a mmap-ed file, and archive is built from this mapping.

In the from_slice constructor that is fine, because ownership of the data is coming from outside the function (before calling the function, the caller need to prepare the slice, be it from mmap or any other valid Rust slice).

For new, which takes a file path, I'm opening the file with mmap and creating an archive, but can't really create a proper ZipStorage because there is self-referential ownership going on: archive lifetime points to the lifetime of the mapping field in the struct.

The consequence of that is that you can see a lot of condition checking to see if mapping or archive are not None. For example, in list_sbts we have

    pub fn list_sbts(&self) -> Result<Vec<String>, Error> {
        let sbts_in_archive = |archive: &'_ piz::ZipArchive<'_>| {
            // <snip> do something interesting with archive
        };

        if let Some(archive) = &self.archive {
            sbts_in_archive(archive)
        } else {
            let archive = piz::ZipArchive::new((&self.mapping.as_ref()).unwrap())
                .map_err(|_| StorageError::EmptyPathError)?;
            sbts_in_archive(&archive)
        }
    }

and that else is the one that kills a lot of performance, because for every single operation it needs to parse the zip file header in order to have a valid archive, and then it is discarded (only to be rebuilt again next time the method is called).

To make things worse, again due to lifetime shenanigans, it is hard to keep the highly useful file tree that piz builds for quick lookup, and it also needs to be rebuild (or, what I'm doing now, just do a linear search on all entries of the zipfile).

What I'm saying: the current implementation is INCREDIBLY inefficient, but it is fixable (probably with ouroboros or a better internal API for initializing fields), but it won't change the public API or FFI. Hence why I set up benchmarks in the Python side, and was surprised that this wildly inneficient impl is faster.

(why choose piz then? is the best crate for doing parallel reads from the zipfile with Rayon, so even if it is an ugly impl now it is the best chance to have it work well with greyhound)

@mr-eyes
Copy link
Member

mr-eyes commented Apr 4, 2022

  • can someone take it for a ride with large zipfiles? I did sourmash search GCA_016787465.1.sig genbank-2022.03-protozoa-k31.zip for latest and this PR and they take about the same time.

I benchmarked latest vs this PR by searching GCF_000005845.2_ASM584v2_genomic.fna.gz.sig on gtdb-rs202.genomic-reps.k31.zip

Latest

Wall time: 2:16.25
Memory: 197 MB

PR

Wall time: 9:33.98
Memory: 1.4 GB

@luizirber
Copy link
Member Author

What I'm saying: the current implementation is INCREDIBLY inefficient, but it is fixable (probably with ouroboros or a better internal API for initializing fields), but it won't change the public API or FFI. Hence why I set up benchmarks in the Python side, and was surprised that this wildly inneficient impl is faster.

self-referential crimes successfully committed with ouroboros!

I benchmarked latest vs this PR by searching GCF_000005845.2_ASM584v2_genomic.fna.gz.sig on gtdb-rs202.genomic-reps.k31.zip

Wall time: 9:33.98
Memory: 1.4 GB

Can you please rerun this, @mr-eyes ?

A note on memory: since this is using mmap, memory reporting can be a bit wonky. Are you tracking RSS or virtual memory?

@mr-eyes
Copy link
Member

mr-eyes commented Apr 5, 2022

@luizirber That's quite an improvement in run time!

Elapsed (wall clock) time (h:mm:ss or m:ss): 2:18.26
Maximum resident set size (kbytes): 1474388 (Same)

The memory here is the maximum actual physical memory assigned to the whole process.

Programming Language :: Python :: 3.8
Programming Language :: Python :: 3.9
Copy link
Member

@mr-eyes mr-eyes Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean sourmash is installable on a py3.10/py3.9 env but sourmash itself does not support >py3.8 in the dev code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tags are inclusive, so it means it is being advertised as supporting 3.8, 3.9, and 3.10.


let files = storage.filenames()?;

// FIXME: use the ForeignObject trait, maybe define new method there...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would really appreciate elaborating more on the FIXME/TODO messages, like why? or a little hint :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is pretty much copy-and-pasted from https://github.com/sourmash-bio/sourmash/blob/1229dc1d6664004e580e3b80d4c13a8ae12a19be/src/core/src/ffi/signature.rs#L284L292

ForeignObject is defined here, and collect some common functionality for working with the FFI layer (going from a Rust object to a raw pointer, or from a raw pointer to a Rust object). What is missing there is working with arrays of objects (like this function here, which returns a vector of SourmashStr).

Most of the decisions/designs in the FFI layer in sourmash comes from symbolic, and you can see other ways to use ForeignObject over there)

Copy link
Member

@mr-eyes mr-eyes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely looks so great and amazing to me!

@ctb
Copy link
Contributor

ctb commented Apr 6, 2022

@luizirber That's quite an improvement in run time!

Elapsed (wall clock) time (h:mm:ss or m:ss): 2:18.26
Maximum resident set size (kbytes): 1474388 (Same)

The memory here is the maximum actual physical memory assigned to the whole process.

With latest it was -

Wall time: 2:16.25
Memory: 197 MB

does that mean that the main performance-related outcome of this PR is to keep wall time the same and increase memory usage by a factor of 7?

(That's fine 'cause I understand this PR enables other things, just want to make sure I'm getting it right.)

@luizirber
Copy link
Member Author

does that mean that the main performance-related outcome of this PR is to keep wall time the same and increase memory usage by a factor of 7?

Yes. Sort of. mmap makes memory complicated to measure, because now the OS is deciding how much to pull from disk. It also works with less memory (because the mmap-ed pages are cached and can be reloaded from disk, so they are discarded if the OS decides so), but when you measure it looks like way more (just because the OS will use more memory if it is available and unused). Concretely, if you measure the heap size (the amount of memory that the process actually allocated for its own use) they are the same in this PR and latest.

(This is also something to keep in mind any time you evaluate benchmarks between software that uses mmap versus those that don't use it =P)

I checked this behavior in Linux, curious how it works on macOS (especially on M1).

(That's fine 'cause I understand this PR enables other things, just want to make sure I'm getting it right.)

There is another approach that keeps the current ZipStorage in Python, but still allows it to be usable in Rust. In #1526 I started adding ._as_rust() methods that can convert the Python object into an equivalent Rust object:
https://github.com/sourmash-bio/sourmash/pull/1526/files#diff-51443303a0e6c38023d8977d891df4ae18dc06a2431d29338fa8648a042132bfR156-R170
for ZipStorage it would have to flush the file (to guarantee any in-memory content is in the on-disk file) and then pass the path to the Rust ZipStorage, and open it there. I can do that now since I understand where zip file writing is done in the Python side (which need this flushing), and would allow passing the ZipStorage into Rust when needed.

@luizirber
Copy link
Member Author

Yes. Sort of. mmap makes memory complicated to measure, because now the OS is deciding how much to pull from disk. It also works with less memory (because the mmap-ed pages are cached and can be reloaded from disk, so they are discarded if the OS decides so), but when you measure it looks like way more (just because the OS will use more memory if it is available and unused). Concretely, if you measure the heap size (the amount of memory that the process actually allocated for its own use) they are the same in this PR and latest.

I tested this hypothesis by creating a docker image out of this branch, and running with a limited memory container:

$ nix build .#docker
$ docker load < result
$ docker run -it -m 250mb -v /data:/data:ro -v /scratch:/scratch:ro sourmash:4.3.0 /bin/sourmash search /data/wort/wort-genomes/sigs/GCA_016787465.1.sig /scratch/databases/genbank-2022.03-protozoa-k31.zip

and it finished in about the same time (58s) as other previous tests. RSS outside the container was 1.7GB.

Meaning: yes, the OS will avoid purging pages from memory in case they are needed, but if you have the memory it doesn't matter much. Looming question is the interaction of this with job schedulers in HPC clusters, because they might try to kill the job.

@ctb
Copy link
Contributor

ctb commented Apr 8, 2022

cool - all sounds good to me, if you want to merge ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants