parallel/pipelined extraction #208

cosmicexplorer · 2024-07-06T15:53:58Z

Problem

ZipArchive::extract() corresponds to the way most zip implementations perform the task, but it's single-threaded. This is appropriate under the assumptions imposed by rust's Read and Seek traits, where mutable access is necessary and only one reader can extract file contents at a time, but most unix-like operating systems offer a pread() operation which avoids mutating OS state like the file offset, so multiple threads can read from a file handle at once. The go programming language offers io.ReaderAt in the stdlib to codify this ability.

Solution

This is a rework of #72 which avoids introducing unnecessary thread pools and creates all output file handles and containing directories up front. For large zips, we want to:

create output handles and containing directories up front,
split the input file handle into chunks to process the constituent file entries in parallel,
for large compressed entries, pipe their content into a dedicated stream to avoid intermixing i/o and decompression and blocking quick small entries later in the file.

src/read/split.rs was created to cover pread() and other operations, while src/read/pipelining.rs was created to perform the high-level logic to split up entries and perform pipelined extraction.

Result

The parallelism feature was added to the crate to gate the newly added code + API.
A dependency on the libc crate was added for #[cfg(all(unix, feature = "parallelism"))] in order to make use of OS-specific functionality.
zip::read::split_extract() was added as a new external API to extract &ZipArchive<fs::File> when #[cfg(all(unix, feature = "parallelism"))].

Note that this does not handle symlinks yet, which I plan to add in a followup PR.

CURRENT BENCHMARK STATUS

On a linux host (with splice() and optionally copy_file_range()), we get about a 6.5x speedup with 12 decompression threads:

> cargo bench --features parallelism -- extract
running 2 tests
test extract_basic           ... bench: 104,389,978 ns/iter (+/- 5,715,453) = 85 MB/s
test extract_split           ... bench:  16,274,974 ns/iter (+/- 1,530,257) = 546 MB/s

The performance should keep increasing as we increase thread count, up to the number of available CPU cores (this was running with a parallelism of 12 on my 16-core laptop). This also works on macOS and BSDs, and other #[cfg(unix)] platforms.

src/read.rs

src/unstable/read.rs

+            let mut data = ZipFileData::from_local_block(block, reader)?;
+
+            match parse_extra_field(&mut data) {
+                /* FIXME: check for the right error type here instead of accepting any old i/o


src/unstable/read.rs

+            // We can't use the typical ::parse() method, as we follow separate code paths depending
+            // on the "magic" value (since the magic value will be from the central directory header
+            // if we've finished iterating over all the actual files).
+            /* TODO: smallvec? */


src/unstable/read.rs

+        }
+        if let Some(info) = data.aes_mode {
+            #[cfg(not(feature = "aes-crypto"))]
+            /* TODO: make this into its own EntryReadError error type! */


src/read/pipelining.rs

src/types.rs

+    #[allow(dead_code)]
+    pub(crate) fn is_symlink(&self) -> bool {
+        self.unix_mode()
+            /* TODO: could this just be != 0? */


src/read/pipelining.rs

src/read/split.rs

+    use std::ops;
+
+    pub trait FixedFile {
+        /* FIXME: use a type alias instead of raw `u64`? */


src/read/split.rs

+    }
+
+    pub trait InputFile: FixedFile {
+        /* FIXME: this should be MaybeUninit! */


src/read/pipelining.rs

+        }
+
+        let block = {
+            /* FIXME: MaybeUninit! */


src/read/pipelining.rs

src/read/split.rs

+        &mut *(slice as *mut [MaybeUninit<T>] as *mut [T])
+    }
+
+    /* TODO: replace with MaybeUninit::copy_from_slice() when stabilized! */


src/read/split.rs

+        fn pread(&self, start: u64, buf: &mut [MaybeUninit<u8>]) -> io::Result<usize>;
+    }
+
+    /* TODO: replace with MaybeUninit::slice_assume_init_mut() when stabilized! */


src/read/pipelining.rs

+            perms_todo,
+        } = transform_entries_to_allocated_handles(top_level_extraction_dir, trie)?;
+
+        /* TODO: Split up the entries into approximately equal size sequential chunks according to


src/read/pipelining.rs

+    }
+}
+
+/* FIXME: remove this to avoid the dependency on crate::unstable::read! */


src/read/pipelining.rs

+
+            /* Create test archive. */
+            let mut zip = ZipWriter::new(tempfile::tempfile().unwrap());
+            /* FIXME: add a compressed file to test pipelined decompression. */


src/read/pipelining.rs

+
+        use super::*;
+
+        /* FIXME: add a compressed file to test pipelined decompression. */


src/read/pipelining.rs

+            let (compressed_read_end, compressed_write_end) = create_pipe()?;
+            let (uncompressed_read_end, uncompressed_write_end) = create_pipe()?;
+
+            /* FIXME: Split up the entries into approximately equal size sequential chunks according


src/read/split.rs

+    pub trait OutputFile: FixedFile {
+        fn pwrite(&mut self, start: u64, buf: &[u8]) -> io::Result<usize>;
+
+        /* TODO: test this! */


src/read/pipelining.rs

+                        uncompressed_sender
+                            .send((entry, output_file))
+                            .map_err(|_| SplitExtractionError::SendFailed)?;
+                        /* FIXME: use persistent buffer instead of whatever io::copy uses! */


cosmicexplorer · 2024-07-14T08:48:45Z

@Pr0methean this is obviously quite a large change, but it's completely separate from any existing APIs and should hopefully remain that way. I was able to avoid adding any new dependencies except for libc to make use of pread() and pipe(), and the external API changes are all gated behind #[cfg(feature = "parallelism")]. Please let me know if there are further types of benchmarks or test cases you would like me to add before taking a further look at it, or if you'd prefer to put it under crate::read::unstable for now to encourage experimentation. I'm quite enthusiastic that this seems to work for any #[cfg(unix)] platform and can be applied to a &ZipArchive<fs::File> instead of requiring users to go through any really complex new API.

Pr0methean · 2024-07-15T16:09:21Z

I'll take a look at this once we're caught up on the smaller PRs.

- initial sketch of lexicographic trie for pipelining - move path splitting into a submodule - lex trie can now propagate entry data - outline handle allocation - mostly handle files - mostly handle dirs - clarify symlink FIXMEs - do symlink validation - extract writable dir setting to helper method - modify args to handle allocation method - handle allocation test passes - simplify perms a lot - outline evaluation - handle symlinks - BIGGER CHANGE! add EntryReader/etc - make initial pipelined extract work - fix file perms by writing them after finishing the file write - support directory entries by unix mode as well - impl split extraction - remove dependency on reader refactoring

cosmicexplorer · 2024-07-16T01:36:55Z

cc @matthewgapp lmk if this addresses your needs for parallel extraction, please feel free to ping others who might be interested in this as well. I know @sluongng was interested in functionality to modify entry names e.g. stripping a prefix before extraction; I'm hoping to add additional functionality like that + symlink support in followup changes but would like to hear perf feedback + any further benchmarks I should add. Please feel free to leave comments here or on #193 for further work.

sluongng · 2024-07-16T13:18:05Z

Without much context into this PR, here is the context of what I am trying to build:

In some advance build tools such as Bazel, Buck2, etc... There is a common pattern: fetch a source archive from Github, unpack the archive, and use the source inside to build things. Example archives could be declared like this:

http_archive(
    name = "platforms",
    sha256 = "5308fc1d8865406a49427ba24a9ab53087f17f5266a7aabbfc28823f3916e1ca",
    urls = [
        "https://mirror.bazel.build/github.com/bazelbuild/platforms/releases/download/0.0.6/platforms-0.0.6.tar.gz",
        "https://github.com/bazelbuild/platforms/releases/download/0.0.6/platforms-0.0.6.tar.gz",
    ],
)

http_archive(
    name = "com_google_absl",
    sha256 = "987ce98f02eefbaf930d6e38ab16aa05737234d7afbab2d5c4ea7adbe50c28ed",
    strip_prefix = "abseil-cpp-20230802.1",
    urls = ["https://github.com/abseil/abseil-cpp/archive/refs/tags/20230802.1.tar.gz"],
)

Note that the archive could be in multiple different formats: zip, tar.gz, tar.zst, etc...

Then the build tool is supposed to fetch the archive from one of the URLs, perform checksum verify over the archive, and extract the archive with the prefix trimmed. Traditionally this is done with the host machine's tar command.

I am trying to replicate this feature in https://github.com/sluongng/downloader/ with a few additional improvements:

Check if upstream URLs support multi-part download and fetch multiple parts in parallel.
Hedge the download request between multiple URLs.
Parallel archive extraction with prefix trimming.
Do not depend on the host's tar command.

I have been playing with zip2 API for the last few days to get the prefix trimming to work trivially. I hope that this PR, or part of this PR can help me solve my problem more easily 🤗

cosmicexplorer · 2024-07-16T19:23:59Z

@sluongng thanks so much! I asked for your input here especially so it might be useful in determining the most appropriate way to introduce functionality like prefix trimming. I have introduced a struct ExtractionParameters here to the added split_extract() method:

zip2/src/read/pipelining.rs

Lines 700 to 738 in bf43e53

    
           /// Parameters to control the degree of parallelism used for extraction. 
        
           #[derive(Debug, Clone)] 
        
           pub struct ExtractionParameters { 
        
               /// Number of threads used for decompression. 
        
               /// 
        
               /// Default value: 4. 
        
               /// 
        
               /// Note that multiple times this many threads will be spawned by [`split_extract()`] as 
        
               /// part of the pipelined process. Only this many threads will be used to perform 
        
               /// decompression in rust code, but other threads will be used to wait on I/O from 
        
               /// the kernel. 
        
               pub decompression_threads: usize, 
        
               /// Size of buffer used to copy a decompressed entry into the corresponding output pipe. 
        
               /// 
        
               /// Default value: 1MB. 
        
               pub decompression_copy_buffer_length: usize, 
        
               /// Size of buffer used to copy stored entries into the output file. 
        
               /// 
        
               /// Used on non-Linux platforms without 
        
               /// [`copy_file_range()`](https://www.gnu.org/software/libc/manual/html_node/Copying-File-Data.html), 
        
               /// as well as on Linux when the input and output file handles are on separate devices. 
        
               /// 
        
               /// Default value: 1MB. 
        
               pub file_range_copy_buffer_length: usize, 
        
               /// Size of buffer used to splice contents from a pipe into an output file handle. 
        
               /// 
        
               /// Used on non-Linux platforms without [`splice()`](https://en.wikipedia.org/wiki/Splice_(system_call)). 
        
               /// 
        
               /// Default value: 1MB. 
        
               #[cfg(not(target_os = "linux"))] 
        
               pub splice_read_buffer_length: usize, 
        
               /// Size of buffer used to splice contents from an input file handle into a pipe. 
        
               /// 
        
               /// Used on non-Linux platforms without [`splice()`](https://en.wikipedia.org/wiki/Splice_(system_call)). 
        
               /// 
        
               /// Default value: 1MB. 
        
               #[cfg(not(target_os = "linux"))] 
        
               pub splice_write_buffer_length: usize, 
        
           }

It's used for perf knobs right now, but could probably be extended to support other configuration, still without messing with the more stable API of the existing zip crate.

Regarding tar archives: I have been tossing around with @jonjohnsonjr a sort of higher-level library for manipulating zip and tar archives using some of the indexing tools he's been developing for tars/tgzs (https://github.com/jonjohnsonjr/targz), which make them seekable/splittable like we can do with zip files in this PR. I suspect @Pr0methean is interested in keeping the zip crate focused on zip files, so in that case what might make sense is for me to take everything @Pr0methean isn't interested in maintaining and pulling it out into that separate library for your purposes.

While I was able to avoid introducing too many extra dependencies (zero except for libc) to achieve split extraction in this PR, I am also totally ready for @Pr0methean to say this is too much new code to feel comfortable adding. This is especially reasonable since one of his goals for this crate is safety and support for more features of the zip format, whereas I'm more interested in finding a common subset of zips and tars which allow us to do high-performance splitting and merging (see #193). If so, I would then work to figure out the minimal set of APIs we would want to expose from the zip crate to enable this sort of parallelized work in a separate crate. I created medusa-zip for this purpose previously when the zip crate was unmaintained. and while I've been able to make a lot of changes to the zip crate itself already, this might be the point where we find it's time to break off.

@Pr0methean is a fantastic maintainer and is super focused on making this crate very robust, so I'm saying this out loud so we can figure out where to split responsibility for maintenance so nobody gets overwhelmed.

cosmicexplorer · 2024-07-17T00:57:17Z

Requirements

Thinking about how to extract this kind of splitting (and all the libc stuff with pipe() and pread()) into a separate crate. I think the parts we'd need to expose would be:

get_or_find_data_start() (which uses pread()): https://github.com/zip-rs/zip2/blob/9067c37078922578a8be395310feadab4045afe7/src/read/pipelining.rs#L656-L697
make_crypto_reader() and make_reader() (which we may choose to make public if refactor readers to use type parameters and not concrete vtables #207 is accepted): https://github.com/zip-rs/zip2/blob/9067c37078922578a8be395310feadab4045afe7/src/read/pipelining.rs#L900-L915
crate::spec::is_dir() (to calculate whether an entry name ends in a directory component): https://github.com/zip-rs/zip2/blob/9067c37078922578a8be395310feadab4045afe7/src/read/pipelining.rs#L47

In general, we would also need to expose methods of ZipFileData without actually exposing the ZipFileData struct itself, so that we could perform the work needed for lexicographic_entry_trie(): https://github.com/zip-rs/zip2/blob/9067c37078922578a8be395310feadab4045afe7/src/read/pipelining.rs#L126-L196

Implementation

I think this could all be satisfied with the following:

Expose crate::spec::is_dir().
Expose a method ZipArchive::get_reader(&self) -> &R which provides a const reference to the internal Read handle.
- For cases like &ZipArchive<fs::File>, this should be sufficient for an external crate to retrieve the fs::File handle and use the libc crate to perform methods like pread(): https://github.com/zip-rs/zip2/blob/9067c37078922578a8be395310feadab4045afe7/src/read/pipelining.rs#L803-L805
Expose a new method ZipArchive::iter_data_entries(), which:
- returns an iterator of opaque ZipFileData handles (using the semi-hidden trait ArchiveEntry from refactor readers to use type parameters and not concrete vtables #207 to avoid leaking ZipFileData),
- populates ZipFileData::data_start() for each entry as it's handed out by the iterator (or exposes an interface to let the user calculate this; this would be preferable as it would avoid taking an exclusive lock on the ZipArchive file handle).
Expose zip::unstable::read::construct_decompressing_reader() from refactor readers to use type parameters and not concrete vtables #207, so that an external crate can decompress the contents of a zip entry e.g. via a pipe, as is done in this PR:

zip2/src/unstable/read.rs

Line 291 in bbee45b

pub(crate) fn construct_decompressing_reader<R>(

Result

We wouldn't need to modify any of the existing ZipArchive API.
We would be able to demonstrate the performance improvements for split extraction in a separate crate, without introducing unsafe code and libc dependencies to the base zip crate.
We would be able to kick the tires around on the unstable API from refactor readers to use type parameters and not concrete vtables #207 and get user feedback before making the breaking change from ZipFile => ZipEntry.

@Pr0methean I have no problem putting in the work to make this change acceptable for the zip crate itself, but after the brainstorming above, I think it's likely to be more effective if we instead expose just enough functionality to implement this from a separate crate, especially given the reliance of this PR's approach on platform-specific unsafe syscalls from libc. If you agree on those broad strokes, I will instead begin to rework this change to expose just the functionality needed to implement split extraction from a separate crate. Does that sound like a good idea?

Pr0methean

Some of the assumptions this PR makes about filesystems are incorrect. (It's possible that one or two of these bugs are pre-existing, but that's no excuse not to fix them.) This is just a partial review.

benches/extract.rs

Pr0methean · 2024-07-19T19:58:35Z

src/read.rs

+        /* TODO: do we want to automatically make the directory writable? Wouldn't we prefer to
+         * respect the write permissions of the extraction dir? Pipelined extraction does not
+         * mutate permissions like this. */


Suggested change

/* TODO: do we want to automatically make the directory writable? Wouldn't we prefer to

* respect the write permissions of the extraction dir? Pipelined extraction does not

* mutate permissions like this. */

We need to make the directory temporarily writable, in case it contains files that we need to extract. https://github.com/zip-rs/zip2/blob/master/tests/repro_old423.rs would break otherwise, because it contains a non-empty and non-writable folder.

Ok, that makes sense. I was being lazy here 😅 the permissions mechanism perms_todo totally works to solve this (patterned after the existing .extract() code), but I was hoping to avoid handling perms like that. However, since this is explicitly only supporting #[cfg(unix)] targets for now (I'm not sure how to achieve something like pread() on windows), it might not be as hard as I thought. This should be easy to integrate, thanks.

Ok, so I just added a pipelined version of that test, and it seems to work? 4bbc351 In both the current .extract() method and pipelined extraction, we only apply perms after all the files and directories are created and written. I'm leaning more towards not trying to circumvent the permissions of existing directories on disk, since I think it's very surprising that a non-writable directory would become writable just because we (e.g. accidentally) extracted a zip file into it.

Makes sense; I agree that only new directories should be temporarily writable.

src/read/pipelining.rs

Pr0methean · 2024-07-19T20:09:09Z

src/read/pipelining.rs

+        }
+        if ret.is_empty() {
+            return Err(PathSplitError::PathFormat(format!(
+                "path {:?} resolves to the top-level directory",


This is valid if the destination is the root directory -- we just shouldn't update properties in that case.

Ah ok, that makes sense. I'll also need to update this code anyway to address making directories temporarily writable as you mentioned in the other comment.

Hmm, actually, clarification: when you say "destination", you're referring to the extraction dir? It's probably not clear, but this method normalize_parent_dirs() accepts the entire path string (including final filename component), and uses the is_dir boolean tuple argument to indicate whether to split off the final component as a filename. So if we have ret.is_empty(), what that means is that the entire path resolves to the equivalent of ./. Are you saying that ./ directory entries are valid, but we should just ignore them instead of erroring here? That would make sense, just want to make sure.

Yes, that's right.

Pr0methean · 2024-07-19T20:09:39Z

src/read/pipelining.rs

+                "." => (),
+                /* If ".." is present, pop off the last element or return an error. */
+                ".." => {
+                    if ret.pop().is_none() {


This is valid if the destination is the root directory, because /.. is the same as / -- although again, we shouldn't update properties.

Ah ok, so this is literally handling the case where instead of a path relative to the extraction dir, the entry is supposed to expand into an absolute path? From the current .extract() code where we call .enclosed_name(), it looks like we currently don't support absolute extraction paths? I think my intent here was to mimic that logic, but are you saying instead that we should convert absolute entry names beginning with / as relative to the extraction dir?

This code is purely processing entry names right now (not yet conjoined to the extraction dir), so I was under the impression they should all be relative paths, and that (like .enclosed_name()), if they use too many ..s, that we should error out. I'm not sure how to square that with "if the destination is the root directory". This is probably pretty simple but I would appreciate further clarification here (thanks!).

Well, the POSIX standard states that /.. and / are the same directory. Unless the ZIP APPNOTE says otherwise, I want to follow that standard.

Pr0methean · 2024-07-19T20:10:02Z

src/read/pipelining.rs

+    #[derive(PartialEq, Eq, Debug, Clone)]
+    pub(crate) enum FSEntry<'a, Data> {
+        Dir(DirEntry<'a, Data>),
+        File(Data),


ZIP files can also contain symlinks.

I wanted to avoid that at first (because it would require reading the symlink targets from the ZipArchive), but I think I'm ready to implement that pass now.

src/read/pipelining.rs

Pr0methean · 2024-08-17T03:07:00Z

Looks like we've managed to break GitHub's rebase logic. (I've already tried clicking the "Try again" button several times.)

cosmicexplorer · 2024-08-17T18:39:30Z

I think that might have been because I started this from my fork of zip-old (https://github.com/cosmicexplorer/zip), instead of zip2 (https://github.com/cosmicexplorer/zip2) (or at least, that caused an error in a different PR, and I'm pretty sure it's true for this one, since I don't think I've touched it since I switched my origin remote to my zip2 fork).

I'm also going to focus on this PR now (I've just converted my other two open PRs to drafts), so I'm going to also try opening a new version of it against my zip2 fork. I will not miss any of your comments--I will open up new inline review comments where I haven't been able to unambiguously resolve them.

cosmicexplorer · 2024-08-17T18:43:03Z

This is now available at #236! I will ping when this is again ready for review (hoping to make progress on it today!).

This comment was marked as resolved.

Sign in to view

This was referenced Jul 6, 2024

parallel operations workstream #193

Open

perf: Pipelined parallel extract #72

Closed

cosmicexplorer force-pushed the pipelining branch from f988e77 to c2c1f2b Compare July 6, 2024 18:38

github-advanced-security bot found potential problems Jul 6, 2024

View reviewed changes

cosmicexplorer force-pushed the pipelining branch 12 times, most recently from 5c99c65 to 94e6940 Compare July 11, 2024 02:08

github-advanced-security bot found potential problems Jul 11, 2024

View reviewed changes

src/read/split.rs Outdated

use std::ops;

pub trait FixedFile {

/* FIXME: use a type alias instead of raw `u64`? */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment

github-advanced-security bot found potential problems Jul 11, 2024

View reviewed changes