Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spec: 0-sized compressed blocks in frames with window_size = 0 #3482

Closed
dweiller opened this issue Feb 8, 2023 · 8 comments
Closed

spec: 0-sized compressed blocks in frames with window_size = 0 #3482

dweiller opened this issue Feb 8, 2023 · 8 comments

Comments

@dweiller
Copy link

dweiller commented Feb 8, 2023

The spec (according to my interpretation) would suggest that it is not possible have a compressed block in a frame that has Window_Size equal to 0. A compressed block always needs at least two bytes of Block_Content (the literals and sequences headers).

The test case at test/golden-decompression/empty-block.zst is the bytes 28b5 2ffd 2000 1500 0000 00; the only bit set in the frame header is the single segment flag set - according to the RFC means that it's Window_Size is the content size, which is 0. The spec then says that the Block_Content is limited by min(1<<17, Window_Size) which is 0 in this case. However, empty-block.zst has a compressed block with block content of size 2. It seems to me that this frame isn't valid according to the spec and would require a window descriptor to increase the window size while keeping the content size at 0.

We came across this issue while implementing a decompressor for Zig and were unsure what the correct behaviour is supposed to be. I've just found #3090 and #3118 which make it look like the current behaviour (this frame decompressing successfully) is intended, but in this case I think the spec needs a tweak to cover this edge case.

@dweiller dweiller changed the title 0-sized compressed blocks in frames with window_size = 0 spec: 0-sized compressed blocks in frames with window_size = 0 Feb 8, 2023
@Cyan4973
Copy link
Contributor

Cyan4973 commented Feb 8, 2023

When you mention "The Spec" I presume you refer to https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md , current version (v0.3.7) ?

(Specification has evolved a bit over time, and while the main parts have remained stable, little edge details (like this one) could have moved or the exact wording could have been updated, it's an easy source of difference in interpretation.)

Now, according to the wording of this specification, you are right, an empty frame with a known decompressed size has a window size of 0. Therefore, its maximum block size is 0. Therefore, any block can only have a size of 0 (block header excluded), and therefore can only be sent uncompressed.

Note that this is different from sending an empty compressed block in any other context : as soon as the frame is non empty, windowSize is very likely > 2, and it becomes possible to send an empty block using the compressed format, effectively expanding its size in the process (which is obviously wasteful). But it's nonetheless allowed by the spec.
This is what #3118 deals with.

We shall have a look at the test case at test/golden-decompression/empty-block.zst, your explanation implies that this is an incorrect case to support by a conformant decoder.
A conformant decoder might still support such a case, by being more forgiving than the spec, but it certainly doesn't have to.

The golden sample name empty-block only implies an empty block, and it could be a valid scenario to support if the frame was anything else than an empty frame of known decompressed size. A fix could be to change the frame header of the golden sample to feature a dissociated window size (with therefore a minimum size of 1 KB) and no decompressed size. In which case, it would be a valid frame that a conformant decoder must be able to decode.

@dweiller
Copy link
Author

dweiller commented Feb 8, 2023

Yes, by spec I mean the compression format docs.

A conformant decoder might still support such a case, by being more forgiving than the spec, but it certainly doesn't have to.

I agree.

@squeek502
Copy link

squeek502 commented Feb 8, 2023

It's worth noting that the reference implementation is inconsistent on checking the block size against blockSizeMax, so depending on where exactly the block size is gotten, it can either fail or succeed. For example:

  • An example of a Block_Content check that seems to match what the compression format docs say can be found here
  • The relevant location of the missing check during ZSTD_decompress can be found here (this allows the simple_decompression example to successfully decompress empty-block.zst)
  • The relevant location of the missing check during ZSTD_decompressStream can be found here (this allows the streaming_decompression example to successfully decompress empty-block.zst)

Here's an example of a file that will succeed when run through examples/simple_decompression but fail with Data corruption detected via Decompressed Block Size Exceeds Maximum when run through examples/streaming_decompression (found via fuzzing so the file may be strange):

block-size-exceeds-max-input.zip

(it's a zip file with only the .zst input inside it)

@terrelln
Copy link
Contributor

terrelln commented Feb 8, 2023

Yeah, you are correct that the frame is invalid. It is only intending to test the empty compressed block example. I will update the frame header to use a Window_Descriptor instead of the Single_Segment_flag.

It's worth noting that the reference implementation is inconsistent on checking the block size against blockSizeMax, so depending on where exactly the block size is gotten, it can either fail or succeed. For example:

That is interesting, thanks for the example file, that is very useful!

I don't think I want to change that behavior right now, as we're about to make a release, and I'd rather make it right after the release, to make sure our fuzzers get a chance to test that change, and verify that we can still round trip successfully.

The libzstd decoder is definitely more forgiving than the spec (for example it allows any offset that is within its history buffer, even if it is beyond the window size), especially where we gain performance for being more lax. But we should strive to be consistent with the spec where it is easy to do so.

terrelln added a commit to terrelln/zstd that referenced this issue Feb 8, 2023
This frame is invalid because the `Window_Size = 0`, and the
`Block_Maximum_Size = min(128 KB, Window_Size) = 0`. But the empty
compressed block has a `Block_Content` size of 2, which is invalid.

The fix is to switch to using a `Window_Descriptor` instead of the
`Single_Segment_Flag`. This sets the `Window_Size = 1024`.

Hexdump before this PR: `28b5 2ffd 2000 1500 0000 00`

Hexdump after this PR: `28b5 2ffd 0000 1500 0000 00`

For issue facebook#3482.
@Cyan4973
Copy link
Contributor

Cyan4973 commented Feb 8, 2023

There is indeed a subtle distinction between a conformant decoder, and a validation tool.

A conformant decoder must be able to decode all combinations defined by the spec, but it might also be more permissive. This can be justified by other considerations, such as speed, binary size, memory usage or code complexity.
In contrast, a validation tool shall be very strict about what is or is not defined in the spec, but it wouldn't have to care much about speed or any kind of runtime optimization.

As a side note, the situation is reversed for the compression side : a conformant compressor must only send combinations of features allowed by the spec, but it doesn't have to support them all. This direction is typically easier to understand and accept.

But it follows that employing either a conformant decoder or encoder as validators is partially flawed, because the encoder may never generate some combinations, which are nonetheless allowed by the spec, and the decoder might permissively accept more combinations than what is strictly defined by the spec.

Bridging that gap requires dedicated tools, laser-focused on spec compliance. There are already a few tools of this kind in this repository, to help library implementers, and probably more could be added. But of course, the main issue is that such efforts cost time, and that's the scarcest resource there is.

terrelln added a commit that referenced this issue Feb 8, 2023
This frame is invalid because the `Window_Size = 0`, and the
`Block_Maximum_Size = min(128 KB, Window_Size) = 0`. But the empty
compressed block has a `Block_Content` size of 2, which is invalid.

The fix is to switch to using a `Window_Descriptor` instead of the
`Single_Segment_Flag`. This sets the `Window_Size = 1024`.

Hexdump before this PR: `28b5 2ffd 2000 1500 0000 00`

Hexdump after this PR: `28b5 2ffd 0000 1500 0000 00`

For issue #3482.
@Cyan4973
Copy link
Contributor

Fixed by @terrelln

@squeek502
Copy link

squeek502 commented Feb 16, 2023

Just for clarification, what's the status of this beyond the specific empty-block.zst file? For example, #3482 (comment) and the old empty-block.zst file that the reference decompressor still sees as valid? Would it be worth opening a new issue for either of those?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Feb 16, 2023

Updating the golden file was important, because it represents a use case that a conformant decoder must support, so the older file was effectively requesting conformant decoders to support a behavior which was not in the spec.

Now, as mentioned earlier, the reference implementation is just that, an implementation. It must be able to decode what is defined in the spec, but it may also accept interpreting a few payloads which go beyond the spec, as long as they are not dangerous for system safety, and still seem to represent something valid (i.e. not grossly corrupted).

I haven't checked recently about the old empty-block.zst golden file, but if the reference decoder still supports it, then it wouldn't surprise me much, nor bother me much. It's still easily understandable and unambiguous as representing an "empty block", and yes, the spec stated that such block should not exist because it's larger than the window size (==0 in this case), but as long as these 2 bytes do not trigger any memory safety issue, I don't see that as critical to actively reject them.

Now, to be more complete, it's also not expensive to add such a filter, so if it's deemed preferable, that's certainly something that could be done.

More generally, if the point is to have a tool that would actively reject any zstd payload which goes even mildly beyond the spec, then I think we would need a dedicated tool for that, something separate from the reference decoder, and focused on this one mission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants