Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zstandard decompressor #14394

Merged
merged 62 commits into from
Feb 21, 2023
Merged

Zstandard decompressor #14394

merged 62 commits into from
Feb 21, 2023

Conversation

dweiller
Copy link
Contributor

@dweiller dweiller commented Jan 21, 2023

This is an implementation of a Zstandard decompressor, intended to close #14183.

There are some things that should be done before this is mergeable:

  • implement streaming decompression (allowing decompression of frames that don't specify content size)
  • add function for decompressing multiple frames (not sure the reference implementation ever does this, but the rfc says multiple frames are permitted in a stream/file)
  • audit math operations for those with operands read from input and add safety checks (first pass)
  • audit integer casts for those with argument read from input and add safety checks (first pass)
  • audit slice access for those with access indices depending on input and add safety checks (first pass)
  • doc comments

If any of the above pointers shouldn't block merging an initial implementation and can be moved to follow up issues, or if there are more issues I haven't put above let me know (e.g. the api might need to be cleaned up).

The current implementation of std.compress.zstandard.decode(), which is achieved by decompressing literals on the fly directly into the output buffer, but requires frames to declare their decompressed content size. It might be the case that decompressing literals ahead of time is faster (this could use 4 threads in most cases), though this would require a separate buffer, which could be up to 128KB.

Notes on items in the checklist

  • The safety checks for math operations/casts/slice access are probably required for release-fast and release-small mode, as these disable checks that should make decompression safe (by crashing) in safe build modes; without manual checks or @setRuntimeSafety(true), malicious/malformed data might cause undefined behaviour Everything should have appropriate checks now (all math overflow and slice access issues fuzzing found so far have been fixed).
  • I'm a little unsure at the moment on the best way to implement streaming decompression. My initial plan for this is to decompress sequences into a ring buffer that has capacity equal to the frame window size, while clobbering any existing data; this would require the user to read at least block_maximum_size bytes before continuing decompression (to be sure unread data won't be clobbered), which is the whole window unless the window is bigger than 128KB. If anyone has ideas I'm all ears.

@dweiller
Copy link
Contributor Author

The xxhash implementation could be split out into a separate PR if there is a desire to get that merged earlier or just separate the PRs/discussion.

@andrewrk
Copy link
Member

Hi, thanks for working on this. Here is some early feedback:

  • Don't forget to update std.zig and std/crypto.zig to expose some of these new APIs.

  • The testdata files have to be excluded from zig's installation manually, sorry about that:

    zig/build.zig

    Lines 109 to 115 in c0284e2

    .exclude_extensions = &[_][]const u8{
    // exclude files from lib/std/compress/testdata
    ".gz",
    ".z.0",
    ".z.9",
    "rfc1951.txt",
    "rfc1952.txt",

  • I do recommend to split up this work and try to merge your work as soon as possible, and then to do follow-up work (such as xxhash as you mentioned, as well as your follow-up issues). I've seen a lot of projects like this get close to being done, and then get abandoned, and I'd hate to see that happen here.

  • Nice work on avoiding heap allocation 👨‍🍳 😘 👌

  • As for safety checks, I expect bugs in the decompression code to cause undefined behavior in ReleaseFast and ReleaseSmall modes. This is true in all of the standard library, and zstd is no exception. Fuzz testing and auditing are the tools we have available to reduce the amount of bugs. It is the choice of the application using the standard library to decide whether it wants safety checks in each dependency.

@dweiller
Copy link
Contributor Author

  • Don't forget to update std.zig and std/crypto.zig to expose some of these new APIs.

I've exposed xxhash in std/hash.zig and put Zstandard in std/compress.zig.

  • As for safety checks, I expect bugs in the decompression code to cause undefined behavior in ReleaseFast and ReleaseSmall modes. This is true in all of the standard library, and zstd is no exception. Fuzz testing and auditing are the tools we have available to reduce the amount of bugs. It is the choice of the application using the standard library to decide whether it wants safety checks in each dependency.

Are you saying that the plan should be to have users wrap functions via something like

fn wrappedDecompress(dest: []u8, src: []const u8, verify: bool) !ReadWriteCount {
    @setRuntimeSafety(true);
    return decompressFrame(dest, src, verify);
}

if they want want to compile in release-fast, but not be susceptible to problems caused by bad input? The compressed input contains instructions for copying sections of previously decompressed data and bad/malicious input could cause math under/overflow or buffer overruns. I was thinking that there should be additional checks and return errors to catch this kind of malformed input - the current implementation relies on safe builds to crash on these malformed inputs.

I have added functionality to support streaming decompression (i.e. decoding blocks one at a time into a ring buffer) and a simple (allocating) wrapper utilising it that should support decompressing any Zstandard frame. The main building blocks are exposed, so users can design their own decompression functions.

@ifreund
Copy link
Member

ifreund commented Jan 23, 2023

if they want want to compile in release-fast, but not be susceptible to problems caused by bad input? The compressed input contains instructions for copying sections of previously decompressed data and bad/malicious input could cause math under/overflow or buffer overruns. I was thinking that there should be additional checks and return errors to catch this kind of malformed input - the current implementation relies on safe builds to crash on these malformed inputs.

Failing to handle malformed/malicious input safely is a Bug. We absolutely need code to prevent overflow and memory errors caused by maliciously crafted input. One should never rely on Zig's safety checks triggering for the correctness of software. They exist to catch programming errors (such as failing to handle the full range of possible input while parsing). Perfect software should never trigger any safety checks no matter what input it is given.

@dweiller
Copy link
Contributor Author

dweiller commented Jan 25, 2023

Okay, I think this is ready sans any changes people would like to see to the API first - the error sets might want some cleanup/be made explicit in function signatures. Not sure why the CI is failing, but I don't think it should be related to this PR.

@dweiller dweiller marked this pull request as ready for review January 25, 2023 03:14
@andrewrk
Copy link
Member

+ stage3-debug/bin/zig test ../lib/std/std.zig -femit-docs -fno-emit-bin --zig-lib-dir ../lib
thread 3239124 panic: reached unreachable code
/home/ci/actions-runner8/_work/zig/zig/lib/std/debug.zig:281:14: 0x40d3cac in assert (zig)
    if (!ok) unreachable; // assertion failure
             ^
/home/ci/actions-runner8/_work/zig/zig/src/Autodoc.zig:4177:35: 0x4463b07 in collectStructFieldInfo (zig)
            std.debug.assert(field.type_body_len != 0);
                                  ^
/home/ci/actions-runner8/_work/zig/zig/src/Autodoc.zig:2922:52: 0x41daabe in walkInstruction (zig)
                    try self.collectStructFieldInfo(
                                                   ^
/home/ci/actions-runner8/_work/zig/zig/src/Autodoc.zig:4325:36: 0x4462bf9 in walkRef (zig)
        return self.walkInstruction(file, parent_scope, parent_src, zir_index, need_type);
                                   ^
/home/ci/actions-runner8/_work/zig/zig/src/Autodoc.zig:1596:79: 0x41ce118 in walkInstruction (zig)
            const child = try self.walkRef(file, parent_scope, parent_src, bin.rhs, false);
                                                                              ^
/home/ci/actions-runner8/_work/zig/zig/src/Autodoc.zig:4325:36: 0x4462bf9 in walkRef (zig)
        return self.walkInstruction(file, parent_scope, parent_src, zir_index, need_type);
                                   ^

Your changes have caused autodocs to crash. While this is a flaw in the autodoc system and not your code, it does need to be fixed before this can be merged. Perhaps @kristoff-it or @der-teufel-programming would be willing to help by taking a look and offering a bug fix or workaround to unblock these changes.

@dweiller
Copy link
Contributor Author

dweiller commented Jan 25, 2023

Hmm, odd - I guess there must have been a recent change to autodoc, as I think the CI passed a few days ago.

@der-teufel-programming
Copy link
Contributor

der-teufel-programming commented Jan 25, 2023

I will take a look at this PR and see if I can find and fix the problem, either here or in Autodoc

Edit:
@dweiller @andrewrk I think I found the problem, and it looks like a missing functionality in Autodoc when it comes to analyzing things like struct { u32, u5 }. I will try to fix it as soon as I can

@kristoff-it
Copy link
Member

kristoff-it commented Jan 25, 2023

#14456 was merged, rebase on latest master and everything should work smoothly.

Thank you for your patience, Autodoc is still a work in progress and occasionally we find that we don't support new parts of the language yet.

@dweiller dweiller force-pushed the zstandard branch 3 times, most recently from 0e61592 to 1501f5c Compare January 28, 2023 11:26
@dweiller
Copy link
Contributor Author

dweiller commented Jan 28, 2023

I had a look in src/Package.zig to see how the package manager is decompressing xz and gzip tarballs, and it looks like they both use readers, without ever allocating a slice of all the tarball's bytes. The current Zstandard API doesn't fit this use usage pattern so well, as it uses a passed-in slice of a whole compressed frame (or block if you do more work yourself) in order to avoid internal allocation or copying, so a readAllAlloc() call would be required in fetchAndUnpack().

I could change the ZStandard API (or add a new one) to work with readers directly, but it won't be possible to avoid internal allocations/copying.

@squeek502
Copy link
Collaborator

squeek502 commented Jan 31, 2023

I'd like to run some fuzz testing on this (a la #14500) but am a bit unsure how to implement the fuzzer. What should I be calling to do something like 'try to decompress some arbitrary bytes.' Would it be something like:

// return on error or null, we're just looking for illegal behavior
const decompressed_size = (getFrameDecompressedSize(bytes) catch return) orelse return;

// skippable frame, just return
if (decompressed_size == 0) return;

var buf = try allocator.alloc(u8, decompressed_size);

const verify_checksum = false; // fuzzer would almost never generate a valid checksum I assume
_ = decodeZStandardFrame(buf, bytes, verify_checksum) catch return;
// maybe verify something about the return?
// should ReadWriteCount.write_count always match `decompressed_size`?
// should ReadWriteCount.read_count always match `bytes.len`?

?

I see there's also decodeZStandardFrameAlloc--should that be used instead?

@dweiller
Copy link
Contributor Author

dweiller commented Jan 31, 2023

I'd like to run some fuzz testing on this (a la #14500) but am a bit unsure how to implement the fuzzer.

Some fuzzing would be great! I was thinking about looking into it, but don't have experience fuzzing things.

I guess there are a few approaches you could take - decodeZStandardFrameAlloc is probably the easiest API to use, and the one that would best suit the 'decompress some random bytes' criteria, as it should be able to decompress any valid frame, while decodeZStandardFrame can only decompress those that declare their decompressed size so you'd need to have the fuzzing input have some more specific header bytes to get far through the decoder but you could use it as you outlined.

If it's hard to generate reasonable inputs for these major entry points, you could try fuzzing the inner parts of the API - I meant to make the API in a way that the core functionality can be used to implement more bespoke decoding systems.

I would actually think it might be good to tell it to verify checksums (you can always catch ChecksumFailure and treat it as success), as without it the XxHash code won't be run, though I guess you could fuzz that separately.

@squeek502
Copy link
Collaborator

Sounds good, thanks for the insight--I'll probably write a few different fuzzers. So far I've written an xxhash-specific fuzzer that compares the 32-bit and 64-bit hashes to the C implementation's hashes, and it's running cleanly so far.

@squeek502
Copy link
Collaborator

squeek502 commented Feb 1, 2023

Wrote a basic fuzzer implementation. First, some general comments:

  • Almost all entrypoints implicitly assume at least 4 bytes of src (they just get src[0..4] without checking len), which is an index-out-of-bounds when len is less than 4
  • decodeZStandardFrame and decodeZStandardFrameAlloc both assert that src[0..4] is the magic_number. This should either be mentioned in the doc comment or be made into an error depending on how those functions are intended to be used (i.e. if it's mostly meant for internal usage, then the assertion is fine but should still be mentioned in the doc comment)
  • It seems like there is a missing higher-level 'just decode this blob of bytes' function that can handle any input without the possibility of tripping any assertions or running into any illegal behavior. It also seems like this function might want to handle decoding multiple frames? I'm not too familiar with the zstandard format. EDIT: I see you've mentioned the possibility of a function that handles multiple frames in the OP, ignore that part of this if you'd like; still, a slightly higher level function that can handle any frame type would probably be useful, as decodeZStandardFrameAlloc will trip an assertion if the frame is not of type .zstandard.

That said, I've currently mitigated the above by only running inputs that start with the magic number. Here are some preliminary results (minimized, but not fully de-duplicated, so it's very possible that many of these might overlap in which crashes they trigger):

zstandard-fuzzing-crashes-20230131.zip

The crashes can be reproduced with the following test code:

test "fuzzed input" {
    const input = "(\xb5/\xfd"; // This is the contents of 'id:000000,sig:06,src:000001,time:12,execs:122,op:havoc,rep:2'
    const max_window_size = 256 * 1024 * 1024; // TODO: What is a reasonable value?
    const decompressed = try std.compress.zstandard.decompress.decodeZStandardFrameAlloc(std.testing.allocator, input, false, max_window_size);
    defer std.testing.allocator.free(decompressed);
}

(note: zigescape can be used to get Zig strings from the files (as in the example above), or @embedFile can be used instead)

Most of them seem to be caused by index-out-of-bounds and things like that, which as mentioned by @ifreund in #14394 (comment) should be handled gracefully without ever invoking illegal behavior for any inputs.

@dweiller dweiller force-pushed the zstandard branch 2 times, most recently from f0cdd61 to ea82ec2 Compare February 2, 2023 11:45
@dweiller
Copy link
Contributor Author

dweiller commented Feb 2, 2023

Some API improvements have landed. There are now functions for handling input from readers, along with the high level ZstandardStream, that supports the same API as the gzip and xz streams.

@squeek502 Thanks for the feedback and crash cases, those crashes are now fixed.

  • decodeZStandardFrame and decodeZStandardFrameAlloc both assert that src[0..4] is the magic_number. This should either be mentioned in the doc comment or be made into an error depending on how those functions are intended to be used (i.e. if it's mostly meant for internal usage, then the assertion is fine but should still be mentioned in the doc comment)

Doc comments have been updated to indicate that the first four bytes of the input must be the Zstandard frame magic number.

  • It seems like there is a missing higher-level 'just decode this blob of bytes' function

There is now decodeFrameAlloc() which should be able to handle a random blob/any supported frame (i.e. anything that doesn't want to use a dictionary).

@squeek502
Copy link
Collaborator

Nice! Here's a new set of crashing inputs:

zstandard-fuzzing-crashes-20230202.zip

With this set, each input should trigger a unique crash.

@dweiller
Copy link
Contributor Author

dweiller commented Feb 3, 2023

Nice! Here's a new set of crashing inputs:

Fixed - this fuzzing business feels like cheating for finding bugs.

@squeek502
Copy link
Collaborator

squeek502 commented Feb 3, 2023

New set (the number in each set is getting smaller 👍):

zstandard-fuzzing-crashes-20230202.1.zip

this fuzzing business feels like cheating for finding bugs.

For things like this that it's well suited for, it pretty much is. At some point I might try setting up a fuzzer that compares the result of decompression to the zstandard reference implementation, and then the real cheat codes for finding bugs will be unlocked (since then we can find correctness bugs, too).

@dweiller
Copy link
Contributor Author

dweiller commented Feb 3, 2023

Fixed those crashes.

At some point I might try setting up a fuzzer that compares the result of decompression to the zstandard reference implementation, and then the real cheat codes for finding bugs will be unlocked (since then we can find correctness bugs, too).

The reference implementation's repo has a tool tests/decodecorpus.c that might be useful for this. It can be used to generate random (valid) Zstandard files; I've been testing against a set of 1000 of these.

@dweiller
Copy link
Contributor Author

One more thing -

The safety checks for math operations/casts/slice access are probably required for release-fast and release-small mode, as these disable checks that should make decompression safe (by crashing) in safe build modes; without manual checks or @setRuntimeSafety(true), malicious/malformed data might cause undefined behaviour

Can you please edit your original description to correct this, assuming it is now outdated? Otherwise, if this is still true, I will close the PR and not merge it.

Done

@squeek502
Copy link
Collaborator

squeek502 commented Feb 21, 2023

Will do some last minute fuzzing just to confirm that we're good to go on that front, but here's a quick summary of what's happened with regards to the zstandard upstream:

@dweiller
Copy link
Contributor Author

but here's a quick summary of what's happened with regards to the zstandard upstream:

I was just thinking it would be nice to have a summary like this.

@andrewrk
Copy link
Member

Wow, great work, you two. I didn't realize you had discovered UB upstream.

@dweiller dweiller force-pushed the zstandard branch 2 times, most recently from c6a89d1 to cbb8066 Compare February 21, 2023 03:35
Comment on lines +16 to +19
pub fn DecompressStream(
comptime ReaderType: type,
comptime options: DecompressStreamOptions,
) type {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gzip, lzma, and xz implementations all expose the same std.compress.foo.Decompress signature:

pub fn Decompress(comptime ReaderType: type) type {

This is leveraged in commit d94613c for example to make some code generic over the decompression algorithm.

Is it strictly necessary to have the options comptime-known for the zstd implementation? If not I think it would be better to conform to the existing interface and instead pass the options at runtime to the init() function. This would also reduce generic code bloat in some cases.

Furthermore, std.compress.zstd feels more consistent with the current members of the std.compress namespace to me:

pub const deflate = @import("compress/deflate.zig");
pub const gzip = @import("compress/gzip.zig");
pub const lzma = @import("compress/lzma.zig");
pub const lzma2 = @import("compress/lzma2.zig");
pub const xz = @import("compress/xz.zig");
pub const zlib = @import("compress/zlib.zig");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that std.compress.zstd is probably a better name to go with. As for the std.compress.foo.Decompress signature, the package manager is only using the std.compress.foo.decompress() function which isn't affected by the comptime parameters, but will probably require at least the window_size_max comptime parameter to DecompressStream to match this API.

It is not strictly necessary to have the options be comptime - the verify_checksum option being comptime just allows avoiding, but the main thing is that not having them be comptime will either require us to hard code the values (for max_window_size this is not a good solution as someone may have Zstandard frames they cannot decode because of this), or some other API will have to diverge from the others in std.compress, most likely the init() function as you mentioned, but this is a more intrusive divergence from the other decompression algorithms which I don't think will really allow for code that is generic across the different decompression algorithms. In normal usage you can just use the default comptime parameters to get an API that matches that of the other decompression algorithms.

I have a branch that adds Zstandard support to the package manager generically just like gzip and xz - the first commit on the branch renames all the Decompress/decompress functions to DecompressStream/decompressStream which I think is a better name. As I mentioned earlier if we really want to match the APIs we may need a separate issue to decide on one that works well for all of them (e.g. the Zstandard init() function has no need to return an error union, which the others currently do, though I don't know how necessary this is) and I don't think Decompress/decompress is a great name, especially when you have non-streaming decompression functions available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is how the lzma decompressor currently handles options at runtime:

pub const decode = @import("lzma/decode.zig");

pub fn decompress(
    allocator: Allocator,
    reader: anytype,
) !Decompress(@TypeOf(reader)) {
    return decompressWithOptions(allocator, reader, .{});
}

pub fn decompressWithOptions(
    allocator: Allocator,
    reader: anytype,
    options: decode.Options,
) !Decompress(@TypeOf(reader)) {
    const params = try decode.Params.readHeader(reader, options);
    return Decompress(@TypeOf(reader)).init(allocator, reader, params, options.memlimit);
}

pub fn Decompress(comptime ReaderType: type) type {
     ...

It seems to me that the zstd implementation could use the same pattern here if I understand correctly that the options aren't strictly required to be comptime-known.

const Allocator = @import("std").mem.Allocator;
const assert = @import("std").debug.assert;

const RingBuffer = @This();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API has quite a bit of overlap with std.fifo.LinearFifo(), I'm not sure we should have both in the standard library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is quite a bit of overlap (I wasn't aware of LinearFifo), however if I'm reading correctly LinearFifo as it is currently isn't designed to be used the way the ring buffer is used, so I'm not sure it would be a good option to adapt/extend LinearFifo and make use of it. If it would be confusing to have both, I can take it back out of the std namespace, or we can leave it pending the pre 1.0 stdlib review. Thoughts @andrewrk?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to this discussion, it looks like the lzma implementation has its own LzCircularBuffer type as well which is also a somewhat specialized ring buffer.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe LzCircularBuffer should be replaced by this RingBuffer, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #19231 to track this

@dweiller dweiller requested review from andrewrk and squeek502 and removed request for andrewrk and squeek502 February 21, 2023 13:15
@andrewrk andrewrk merged commit b52be97 into ziglang:master Feb 21, 2023
@squeek502
Copy link
Collaborator

Will do some last minute fuzzing just to confirm that we're good to go on that front

To follow up on this, nothing new was found. All the findings were things that we've seen before and have reasons for the behavior being different from the reference implementation.

@andrewrk
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zstd decompressor
8 participants