-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove BinaryReader and BinaryWriter references from ZipArchive #103153
base: main
Are you sure you want to change the base?
Conversation
We now read the data into a (sometimes stack-allocated) byte array and slice it up with BinaryPrimitives. This reduces the number of reads and writes required to read and write a ZipArchive. It also makes future work to enable async APIs easier, since BinaryReader and BinaryWriter lack this support. Also changed approach to reading central file directory headers. Rather than performing X reads per header, we read 4KB of data at a time and look for all applicable headers in that data. This should improve performance when dealing with many small files.
This allowed the removal of the ArchiveReader property from ZipArchive.
Now pooling the file IO buffers and the temporary buffers for extra fields of the CD file header (which would otherwise be allocated and deallocated in a loop.)
Tagging subscribers to this area: @dotnet/area-system-io-compression |
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Outdated
Show resolved
Hide resolved
This handles 64x entries with 19-character filenames (and thus, 65-byte file headers.) As a result, it straddles two 4KB read buffers. Also corrected the seek logic while reading the central directory header
@edwardneal the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick partial review. Also needs resolving the merge conflict so we get proper CI results.
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/tests/ZipArchive/zip_ReadTests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/tests/ZipArchive/zip_ReadTests.cs
Outdated
Show resolved
Hide resolved
Thanks @carlossanlop - I've addressed the merge conflict and made those test changes. I've got a number of post-merge test failures to deal with - I'll look at those shortly, ready for review post-14th. |
The buffer returned from the ArrayPool contained older data (including a ZIP header.) When reading the last chunk of the file (i.e a chunk which was less than BackwardsSeekingBufferSize) the buffer's Span wasn't resized to account for this. SeekBackwardsToSignature would thus find the older data, try to seek beyond the end of the stream and fail to read the file.
Relates to #83909, #31460.
This removes all references to BinaryReader and BinaryWriter from ZipArchive and ZipArchiveEntry. It also changes the way that the ZIP central directory header is read from the source stream and makes one tweak to the way that the EOCD header is detected.
I've removed BinaryReader and BinaryWriter for three reasons:
The second change is to adjust the way that the ZIP central directory header is read. Previously, this was read from the source stream file-by-file. This PR now reads from the source in 4KB blocks and tries to read the headers from there. This is much faster. I've chosen not to implement it when writing the CD headers because they contain dynamic data and I didn't want to copy buffers around; I'm open to doing so.
The detection of the end-of-central-directory header is very similar too. It was already doing something similar, but with only 16 bytes at a time. I've tweaked this to read 4KB block instead, and changed the way it searches for the EOCD signature to use an intrinsic rather than byte-by-byte bit shuffling.
In both cases, I've picked 4KB because it feels like a small enough buffer to not make a massive difference to wait times, and it aligns with the 4KB buffer on FileStream (which I imagine would be the most common use case.)
There are performance improvements across the board. To benchmark this, I used a test wrapping stream which simulates the worst case - an Xms Thread.Sleep on every Read and Write call. Results are below, but in short:
Reads
Creation
Benchmark results - Reads
Benchmark results - Reads (scaling up by number of ZipArchiveEntry children)
Benchmark results - Creation
Benchmark header
NB: because this changes the number of reads/writes to a stream, it'll have an impact on the tests for #102704. I'll change these depending on the order the PRs are merged in.