Remove BinaryReader and BinaryWriter references from ZipArchive #103153

edwardneal · 2024-06-07T06:45:55Z

This removes all references to BinaryReader and BinaryWriter from ZipArchive and ZipArchiveEntry. It also changes the way that the ZIP central directory header is read from the source stream and makes one tweak to the way that the EOCD header is detected.

I've removed BinaryReader and BinaryWriter for three reasons:

They were being used to read and write to the stream field-by-field - so every file could generate nearly two dozen individual writes to the backing stream.
There are no async implementations on these two objects, so these need to be implemented (or their usage removed) in order to add any future async support for ZipArchive.
Removes some allocations & GC pressure.

The second change is to adjust the way that the ZIP central directory header is read. Previously, this was read from the source stream file-by-file. This PR now reads from the source in 4KB blocks and tries to read the headers from there. This is much faster. I've chosen not to implement it when writing the CD headers because they contain dynamic data and I didn't want to copy buffers around; I'm open to doing so.

The detection of the end-of-central-directory header is very similar too. It was already doing something similar, but with only 16 bytes at a time. I've tweaked this to read 4KB block instead, and changed the way it searches for the EOCD signature to use an intrinsic rather than byte-by-byte bit shuffling.

In both cases, I've picked 4KB because it feels like a small enough buffer to not make a massive difference to wait times, and it aligns with the 4KB buffer on FileStream (which I imagine would be the most common use case.)

There are performance improvements across the board. To benchmark this, I used a test wrapping stream which simulates the worst case - an Xms Thread.Sleep on every Read and Write call. Results are below, but in short:

Reads

12-13% reduction in execution time as a baseline, rising to 18-19% as the number of entries in the archive increases.
33-36% reduction in memory usage
When latency is introduced, as the number of entries in the archive increases, the reduction in execution time becomes more pronounced - 99.6%

Creation

Execution time is almost identical assuming no latency. As latency increases, the execution time reduces by around 82%
10% reduction in memory usage

Benchmark results - Reads

Method	Job	Runtime	NumberOfFiles	LatencyMS	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen0	Gen1	Allocated	Alloc Ratio
Benchmark	Baseline	.NET 8.0	2	-1	1.280 μs	0.0178 μs	0.0338 μs	1.269 μs	1.00	0.00	0.4482	-	1.84 KB	1.00
Benchmark	CoreRun	.NET 9.0	2	-1	1.131 μs	0.0104 μs	0.0081 μs	1.133 μs	0.87	0.03	0.2861	-	1.17 KB	0.64

Benchmark	Baseline	.NET 8.0	2	0	21.412 μs	0.5187 μs	1.5211 μs	21.083 μs	1.00	0.00	0.4578	-	1.88 KB	1.00
Benchmark	CoreRun	.NET 9.0	2	0	2.871 μs	0.0318 μs	0.0265 μs	2.871 μs	0.13	0.01	0.2937	-	1.21 KB	0.65

Benchmark	Baseline	.NET 8.0	2	1	716,595.380 μs	2,212.8744 μs	2,069.9241 μs	716,359.000 μs	1.00	0.00	-	-	2.64 KB	1.00
Benchmark	CoreRun	.NET 9.0	2	1	62,065.810 μs	280.6085 μs	248.7521 μs	62,120.450 μs	0.09	0.00	-	-	1.29 KB	0.49

Benchmark	Baseline	.NET 8.0	2	5	716,582.567 μs	2,484.5321 μs	2,324.0329 μs	716,860.600 μs	1.00	0.00	-	-	2.64 KB	1.00
Benchmark	CoreRun	.NET 9.0	2	5	62,190.336 μs	212.1850 μs	198.4780 μs	62,244.822 μs	0.09	0.00	-	-	1.29 KB	0.49

Benchmark	Baseline	.NET 8.0	10	-1	3.925 μs	0.0296 μs	0.0247 μs	3.922 μs	1.00	0.00	1.7929	-	7.33 KB	1.00
Benchmark	CoreRun	.NET 9.0	10	-1	3.452 μs	0.0683 μs	0.0786 μs	3.424 μs	0.88	0.03	1.1864	-	4.85 KB	0.66

Benchmark	Baseline	.NET 8.0	10	0	98.469 μs	5.5739 μs	16.0820 μs	92.865 μs	1.00	0.00	1.7090	-	7.37 KB	1.00
Benchmark	CoreRun	.NET 9.0	10	0	5.902 μs	0.1166 μs	0.2011 μs	5.830 μs	0.07	0.01	1.1902	-	4.89 KB	0.66

Benchmark	Baseline	.NET 8.0	10	1	2,966,581.267 μs	5,120.4172 μs	4,789.6414 μs	2,967,184.400 μs	1.00	0.00	-	-	8.09 KB	1.00
Benchmark	CoreRun	.NET 9.0	10	1	62,189.399 μs	283.9015 μs	265.5616 μs	62,213.289 μs	0.02	0.00	-	-	4.97 KB	0.61

Benchmark	Baseline	.NET 8.0	10	5	2,965,424.293 μs	5,345.5963 μs	5,000.2741 μs	2,965,234.300 μs	1.00	0.00	-	-	8.09 KB	1.00
Benchmark	CoreRun	.NET 9.0	10	5	62,184.896 μs	320.7364 μs	300.0170 μs	62,224.356 μs	0.02	0.00	-	-	4.97 KB	0.61

Benchmark	Baseline	.NET 8.0	25	-1	9.367 μs	0.1231 μs	0.1028 μs	9.356 μs	1.00	0.00	4.2114	-	17.22 KB	1.00
Benchmark	CoreRun	.NET 9.0	25	-1	7.661 μs	0.0472 μs	0.0394 μs	7.654 μs	0.82	0.01	2.7771	-	11.34 KB	0.66

Benchmark	Baseline	.NET 8.0	25	0	189.773 μs	1.5383 μs	1.2010 μs	190.032 μs	1.00	0.00	4.1504	-	17.26 KB	1.00
Benchmark	CoreRun	.NET 9.0	25	0	9.558 μs	0.1294 μs	0.1210 μs	9.529 μs	0.05	0.00	2.7771	-	11.38 KB	0.66

Benchmark	Baseline	.NET 8.0	25	1	7,184,576.887 μs	6,870.6351 μs	6,426.7964 μs	7,183,295.200 μs	1.000	0.00	-	-	17.98 KB	1.00
Benchmark	CoreRun	.NET 9.0	25	1	62,115.901 μs	231.8962 μs	216.9159 μs	62,113.444 μs	0.009	0.00	-	-	11.46 KB	0.64

Benchmark	Baseline	.NET 8.0	25	5	7,176,659.979 μs	7,168.4172 μs	6,354.6151 μs	7,177,295.900 μs	1.000	0.00	-	-	17.98 KB	1.00
Benchmark	CoreRun	.NET 9.0	25	5	62,154.444 μs	243.2540 μs	227.5399 μs	62,233.989 μs	0.009	0.00	-	-	11.46 KB	0.64

Benchmark	Baseline	.NET 8.0	50	-1	18.166 μs	0.2379 μs	0.2225 μs	18.171 μs	1.00	0.00	8.4229	-	34.48 KB	1.00
Benchmark	CoreRun	.NET 9.0	50	-1	14.800 μs	0.1155 μs	0.1080 μs	14.796 μs	0.81	0.01	5.6152	-	22.95 KB	0.67

Benchmark	Baseline	.NET 8.0	50	0	368.723 μs	3.1490 μs	2.9455 μs	367.986 μs	1.00	0.00	8.3008	-	34.52 KB	1.00
Benchmark	CoreRun	.NET 9.0	50	0	17.192 μs	0.2263 μs	0.2117 μs	17.291 μs	0.05	0.00	5.6152	0.0305	22.98 KB	0.67

Benchmark	Baseline	.NET 8.0	50	1	14,205,316.860 μs	11,442.8220 μs	10,703.6228 μs	14,203,486.100 μs	1.000	0.00	-	-	35.24 KB	1.00
Benchmark	CoreRun	.NET 9.0	50	1	62,229.530 μs	282.4781 μs	264.2302 μs	62,229.311 μs	0.004	0.00	-	-	23.06 KB	0.65

Benchmark	Baseline	.NET 8.0	50	5	14,211,658.147 μs	9,950.6110 μs	9,307.8077 μs	14,212,125.500 μs	1.000	0.00	-	-	35.24 KB	1.00
Benchmark	CoreRun	.NET 9.0	50	5	62,237.983 μs	341.3432 μs	319.2927 μs	62,250.889 μs	0.004	0.00	-	-	23.06 KB	0.65

Benchmark results - Reads (scaling up by number of ZipArchiveEntry children)

Method	Job	Runtime	NumberOfFiles	LatencyMS	Mean	Error	StdDev	Ratio	RatioSD	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
Benchmark	Baseline	.NET 8.0	2	-1	1.276 μs	0.0162 μs	0.0152 μs	1.00	0.00	0.4482	-	-	1.84 KB	1.00
Benchmark	CoreRun	.NET 9.0	2	-1	1.120 μs	0.0116 μs	0.0129 μs	0.88	0.01	0.2861	-	-	1.17 KB	0.64

Benchmark	Baseline	.NET 8.0	10	-1	4.527 μs	0.0259 μs	0.0229 μs	1.00	0.00	1.7929	-	-	7.33 KB	1.00
Benchmark	CoreRun	.NET 9.0	10	-1	3.964 μs	0.0315 μs	0.0295 μs	0.88	0.01	1.1826	-	-	4.85 KB	0.66

Benchmark	Baseline	.NET 8.0	25	-1	10.897 μs	0.0787 μs	0.0698 μs	1.00	0.00	4.2114	-	-	17.22 KB	1.00
Benchmark	CoreRun	.NET 9.0	25	-1	8.824 μs	0.0499 μs	0.0466 μs	0.81	0.01	2.7771	-	-	11.34 KB	0.66

Benchmark	Baseline	.NET 8.0	50	-1	21.540 μs	0.1920 μs	0.1604 μs	1.00	0.00	8.4229	-	-	34.48 KB	1.00
Benchmark	CoreRun	.NET 9.0	50	-1	17.341 μs	0.1089 μs	0.1019 μs	0.81	0.01	5.6152	-	-	22.95 KB	0.67

Benchmark	Baseline	.NET 8.0	100	-1	43.626 μs	0.4626 μs	0.6016 μs	1.00	0.00	16.9678	0.1221	-	69.46 KB	1.00
Benchmark	CoreRun	.NET 9.0	100	-1	33.674 μs	0.1899 μs	0.1683 μs	0.77	0.01	11.3525	-	-	46.59 KB	0.67

Benchmark	Baseline	.NET 8.0	250	-1	109.428 μs	1.3621 μs	1.2074 μs	1.00	0.00	41.2598	0.1221	-	168.87 KB	1.00
Benchmark	CoreRun	.NET 9.0	250	-1	86.518 μs	1.4735 μs	1.2305 μs	0.79	0.01	27.3438	6.7139	-	112.02 KB	0.66

Benchmark	Baseline	.NET 8.0	500	-1	222.703 μs	1.4561 μs	1.3621 μs	1.00	0.00	83.2520	0.2441	-	340.65 KB	1.00
Benchmark	CoreRun	.NET 9.0	500	-1	171.378 μs	1.2795 μs	1.0684 μs	0.77	0.01	55.4199	0.2441	-	227.16 KB	0.67

Benchmark	Baseline	.NET 8.0	1000	-1	476.601 μs	8.2338 μs	6.8756 μs	1.00	0.00	133.7891	72.2656	-	686.68 KB	1.00
Benchmark	CoreRun	.NET 9.0	1000	-1	361.508 μs	1.1992 μs	1.0630 μs	0.76	0.01	90.8203	47.3633	-	459.91 KB	0.67

Benchmark	Baseline	.NET 8.0	10000	-1	12,294.500 μs	244.9925 μs	547.9609 μs	1.00	0.00	1171.8750	687.5000	265.6250	6879.98 KB	1.00
Benchmark	CoreRun	.NET 9.0	10000	-1	10,722.270 μs	212.4330 μs	504.8696 μs	0.87	0.06	843.7500	578.1250	234.3750	4614.13 KB	0.67

Benchmark results - Creation

Method	Job	Runtime	NumberOfFiles	LatencyMS	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen0	Gen1	Allocated	Alloc Ratio
Benchmark	Baseline	.NET 8.0	2	-1	17.17 μs	0.289 μs	0.256 μs	17.09 μs	1.00	0.00	0.5798	-	2.44 KB	1.00
Benchmark	CoreRun	.NET 9.0	2	-1	14.20 μs	0.214 μs	0.179 μs	14.26 μs	0.83	0.01	0.5188	-	2.15 KB	0.88

Benchmark	Baseline	.NET 8.0	2	0	48.28 μs	0.883 μs	2.115 μs	47.69 μs	1.00	0.00	0.5493	-	2.48 KB	1.00
Benchmark	CoreRun	.NET 9.0	2	0	20.52 μs	0.390 μs	0.848 μs	20.22 μs	0.42	0.03	0.5188	-	2.19 KB	0.88

Benchmark	Baseline	.NET 8.0	2	1	1,153,370.77 μs	3,033.808 μs	2,837.826 μs	1,153,076.60 μs	1.00	0.00	-	-	3.24 KB	1.00
Benchmark	CoreRun	.NET 9.0	2	1	202,163.82 μs	810.531 μs	758.172 μs	202,433.70 μs	0.18	0.00	-	-	2.43 KB	0.75

Benchmark	Baseline	.NET 8.0	2	5	1,154,271.62 μs	2,629.954 μs	2,460.061 μs	1,153,689.70 μs	1.00	0.00	-	-	3.24 KB	1.00
Benchmark	CoreRun	.NET 9.0	2	5	202,071.53 μs	987.541 μs	923.747 μs	201,793.43 μs	0.18	0.00	-	-	2.43 KB	0.75

Benchmark	Baseline	.NET 8.0	10	-1	59.75 μs	0.906 μs	0.803 μs	59.52 μs	1.00	0.00	3.2959	-	13.5 KB	1.00
Benchmark	CoreRun	.NET 9.0	10	-1	59.42 μs	0.269 μs	0.238 μs	59.40 μs	0.99	0.01	2.9297	-	12.21 KB	0.90

Benchmark	Baseline	.NET 8.0	10	0	226.82 μs	5.396 μs	15.395 μs	227.63 μs	1.00	0.00	3.1738	-	13.54 KB	1.00
Benchmark	CoreRun	.NET 9.0	10	0	98.33 μs	1.223 μs	1.084 μs	98.45 μs	0.48	0.03	2.9297	-	12.25 KB	0.90

Benchmark	Baseline	.NET 8.0	10	1	5,274,439.79 μs	6,901.351 μs	6,455.528 μs	5,273,943.90 μs	1.00	0.00	-	-	14.26 KB	1.00
Benchmark	CoreRun	.NET 9.0	10	1	950,228.86 μs	2,438.419 μs	2,161.595 μs	950,750.30 μs	0.18	0.00	-	-	12.97 KB	0.91

Benchmark	Baseline	.NET 8.0	10	5	5,276,480.03 μs	5,844.084 μs	5,466.560 μs	5,279,086.70 μs	1.00	0.00	-	-	14.26 KB	1.00
Benchmark	CoreRun	.NET 9.0	10	5	949,421.14 μs	2,346.035 μs	2,194.483 μs	948,990.50 μs	0.18	0.00	-	-	12.97 KB	0.91

Benchmark	Baseline	.NET 8.0	25	-1	148.85 μs	2.834 μs	3.264 μs	149.14 μs	1.00	0.00	7.5684	-	31.16 KB	1.00
Benchmark	CoreRun	.NET 9.0	25	-1	147.24 μs	1.313 μs	1.229 μs	146.78 μs	0.99	0.03	6.8359	-	28 KB	0.90

Benchmark	Baseline	.NET 8.0	25	0	490.65 μs	5.097 μs	4.257 μs	491.34 μs	1.00	0.00	6.8359	-	31.2 KB	1.00
Benchmark	CoreRun	.NET 9.0	25	0	214.95 μs	1.275 μs	1.065 μs	214.92 μs	0.44	0.00	6.8359	-	28.04 KB	0.90

Benchmark	Baseline	.NET 8.0	25	1	13,010,754.08 μs	7,587.337 μs	7,097.199 μs	13,008,804.00 μs	1.00	0.00	-	-	31.92 KB	1.00
Benchmark	CoreRun	.NET 9.0	25	1	2,353,101.94 μs	4,536.628 μs	3,788.290 μs	2,352,594.90 μs	0.18	0.00	-	-	28.76 KB	0.90

Benchmark	Baseline	.NET 8.0	25	5	13,008,020.51 μs	8,158.693 μs	7,631.646 μs	13,008,868.30 μs	1.00	0.00	-	-	31.92 KB	1.00
Benchmark	CoreRun	.NET 9.0	25	5	2,356,024.47 μs	4,256.167 μs	3,981.221 μs	2,356,131.80 μs	0.18	0.00	-	-	28.76 KB	0.90

Benchmark	Baseline	.NET 8.0	50	-1	294.44 μs	1.924 μs	1.800 μs	294.02 μs	1.00	0.00	15.1367	-	62.7 KB	1.00
Benchmark	CoreRun	.NET 9.0	50	-1	297.19 μs	4.039 μs	3.580 μs	297.07 μs	1.01	0.01	13.6719	0.9766	56.41 KB	0.90

Benchmark	Baseline	.NET 8.0	50	0	981.38 μs	4.847 μs	4.534 μs	980.08 μs	1.00	0.00	13.6719	-	62.74 KB	1.00
Benchmark	CoreRun	.NET 9.0	50	0	428.33 μs	3.014 μs	2.672 μs	428.36 μs	0.44	0.00	13.6719	0.4883	56.45 KB	0.90

Benchmark	Baseline	.NET 8.0	50	1	25,900,673.22 μs	15,093.982 μs	14,118.920 μs	25,902,252.70 μs	1.00	0.00	-	-	63.46 KB	1.00
Benchmark	CoreRun	.NET 9.0	50	1	4,699,400.51 μs	5,271.756 μs	4,931.204 μs	4,697,546.80 μs	0.18	0.00	-	-	57.17 KB	0.90

Benchmark	Baseline	.NET 8.0	50	5	25,891,853.07 μs	12,417.690 μs	11,615.515 μs	25,891,875.90 μs	1.00	0.00	-	-	63.46 KB	1.00
Benchmark	CoreRun	.NET 9.0	50	5	4,698,669.15 μs	7,373.175 μs	6,896.872 μs	4,699,045.70 μs	0.18	0.00	-	-	57.17 KB	0.90

Benchmark header

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3593/23H2/2023Update/SunValley3)
Intel Core i7-8565U CPU 1.80GHz (Whiskey Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.200
  [Host]   : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
  Baseline : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
  CoreRun  : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2

NB: because this changes the number of reads/writes to a stream, it'll have an impact on the tests for #102704. I'll change these depending on the order the PRs are merged in.

We now read the data into a (sometimes stack-allocated) byte array and slice it up with BinaryPrimitives. This reduces the number of reads and writes required to read and write a ZipArchive. It also makes future work to enable async APIs easier, since BinaryReader and BinaryWriter lack this support. Also changed approach to reading central file directory headers. Rather than performing X reads per header, we read 4KB of data at a time and look for all applicable headers in that data. This should improve performance when dealing with many small files.

This allowed the removal of the ArchiveReader property from ZipArchive.

Now pooling the file IO buffers and the temporary buffers for extra fields of the CD file header (which would otherwise be allocated and deallocated in a loop.)

dotnet-policy-service · 2024-06-07T06:46:21Z

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs

This handles 64x entries with 19-character filenames (and thus, 65-byte file headers.) As a result, it straddles two 4KB read buffers. Also corrected the seek logic while reading the central directory header

carlossanlop · 2024-08-05T17:58:40Z

@edwardneal the main branch is currently only taking changes for RC1 that will go into .NET 9. This change currently does not meet the bar for merging it now, but we would love to consider taking it for .NET 10. The RC1 branch will get snapped from main on August 14th, and we can consider merging this PR after that date, when main will start pointing to .NET 10.

carlossanlop

A quick partial review. Also needs resolving the merge conflict so we get proper CI results.

src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs

src/libraries/System.IO.Compression/tests/ZipArchive/zip_ReadTests.cs

…ead-write

edwardneal · 2024-08-08T06:46:46Z

Thanks @carlossanlop - I've addressed the merge conflict and made those test changes.

I've got a number of post-merge test failures to deal with - I'll look at those shortly, ready for review post-14th.
Edit: these should now be resolved, all tests are passing. It's ready for review.

The buffer returned from the ArrayPool contained older data (including a ZIP header.) When reading the last chunk of the file (i.e a chunk which was less than BackwardsSeekingBufferSize) the buffer's Span wasn't resized to account for this. SeekBackwardsToSignature would thus find the older data, try to seek beyond the end of the stream and fail to read the file.

edwardneal added 5 commits June 4, 2024 06:43

Removed BinaryReader from ZipArchiveEntry

13dee7a

This allowed the removal of the ArchiveReader property from ZipArchive.

Removed BinaryWriter from ZipArchiveEntry

650b4f8

Re-adding missing assertion

b0b8f25

Reduced memory usage with array pooling

8d1c807

Now pooling the file IO buffers and the temporary buffers for extra fields of the CD file header (which would otherwise be allocated and deallocated in a loop.)

dotnet-issue-labeler bot added the area-System.IO.Compression label Jun 7, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 7, 2024

bartonjs reviewed Jun 7, 2024

View reviewed changes

src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs Outdated Show resolved Hide resolved

edwardneal added 2 commits June 7, 2024 21:42

Corrected duplicate comment

c56ff3a

Added test for changed central directory read method

9691c2a

This handles 64x entries with 19-character filenames (and thus, 65-byte file headers.) As a result, it straddles two 4KB read buffers. Also corrected the seek logic while reading the central directory header

ericstj requested a review from carlossanlop July 22, 2024 21:10

carlossanlop reviewed Aug 5, 2024

View reviewed changes

carlossanlop added this to the 10.0.0 milestone Aug 5, 2024

carlossanlop assigned edwardneal Aug 5, 2024

edwardneal added 2 commits August 8, 2024 05:05

Merge remote-tracking branch 'upstream/main' into ziparchive-stream-r…

444b85e

…ead-write

Test changes: formatting, converted Theory to Fact

13fc2e8

This was referenced Aug 9, 2024

System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137 #100558

Open

SIGKILL (OOM?) while running LibraryImportGenerator.Tests w/o actionable log messages or artifacts dotnet/dnceng#2496

Open

ericstj assigned carlossanlop Sep 30, 2024

ericstj unassigned edwardneal Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove BinaryReader and BinaryWriter references from ZipArchive #103153

Remove BinaryReader and BinaryWriter references from ZipArchive #103153

edwardneal commented Jun 7, 2024 •

edited

Loading

dotnet-policy-service bot commented Jun 7, 2024

carlossanlop commented Aug 5, 2024

carlossanlop left a comment

edwardneal commented Aug 8, 2024 •

edited

Loading

Remove BinaryReader and BinaryWriter references from ZipArchive #103153

Are you sure you want to change the base?

Remove BinaryReader and BinaryWriter references from ZipArchive #103153

Conversation

edwardneal commented Jun 7, 2024 • edited Loading

Reads

Creation

dotnet-policy-service bot commented Jun 7, 2024

carlossanlop commented Aug 5, 2024

carlossanlop left a comment

Choose a reason for hiding this comment

edwardneal commented Aug 8, 2024 • edited Loading

edwardneal commented Jun 7, 2024 •

edited

Loading

edwardneal commented Aug 8, 2024 •

edited

Loading