DMG/`bzip2` file read error: `bzip2: corrupted input: invalid stream magic` #405

rgmz · 2024-06-07T19:59:06Z

What version of the package or command are you using?

v4.0.0-alpha.8

What are you trying to do?

Decompress a .dmg file detected as .bzip2.

$ file -i BitcoinTemplate.dmg
BitcoinTemplate.dmg: application/x-bzip2; charset=binary

Note that there is a warning about "trailing garbage" when decompressing it using the CLI, however, it still succeeds.

$ bzip2 -d BitcoinTemplate.dmg
bzip2: Can't guess original name for BitcoinTemplate.dmg -- using BitcoinTemplate.dmg.out

bzip2: BitcoinTemplate.dmg: trailing garbage after EOF ignored

What steps did you take?

Download https://github.com/cicorias/bitcoin/raw/a4f2c8419f5a8b47c4b48e962b4ff02885164c9b/contrib/BitcoinTemplate.dmg

Run the following test

func TestDmg(t *testing.T) {
    f, err := os.Open("/tmp/BitcoinTemplate.dmg")
    require.NoError(t, err)

    format, f2, err := archiver.Identify("/tmp/BitcoinTemplate.dmg", f)
    require.NoError(t, err)

    rc, err := format.(archiver.Decompressor).OpenReader(f2)
    require.NoError(t, err)

    _, err = io.ReadAll(rc)
    require.NoError(t, err)

    assert.NoError(t, rc.Close())
}

Receive error from io.ReadAll

 Error:      	Received unexpected error:
    	         bzip2: corrupted input: invalid stream magic

What did you expect to happen, and what actually happened instead?

I expected the file to be identified and decompressed successfully.

How do you think this should be fixed?

Unsure.

Please link to any related issues, pull requests, and/or discussion

trufflesecurity/trufflehog#2935

Bonus: What do you use archiver for, and do you find it useful?

The text was updated successfully, but these errors were encountered:

mholt · 2024-06-07T21:16:34Z

As the docs say:

If stream is non-nil then the returned io.Reader will always be non-nil and will read from the same point as the reader which was passed in; it should be used in place of the input stream after calling Identify() because it preserves and re-reads the bytes that were already read during the identification process.

You're passing in a stream that gets read, but ignoring the returned stream (_) causing corruption to the reader.

rgmz · 2024-06-07T21:26:54Z

Hi Matt,

That appears to be a typo on my part from trying to condense the real messy code into a neat reproducer.

The error still occurs when the returned stream isn't ignored. I've updated the example accordingly.

mholt · 2024-06-07T23:52:31Z

Oh, ok. I will take a look.

mholt · 2024-06-08T13:45:58Z

I get the same error with that file, but I also do when just using the underlying bzip2 lib directly. I would open the issue with @dsnet -- he's quite smart and can probably tell you whether it's a bug in bzip2 package or possibly this DMG is malformed.

rgmz · 2024-06-08T21:53:19Z

Will do. Thank you for the insight!

dsnet · 2024-06-10T07:42:04Z

Hey, thanks for the ping.

It seems to me that the Go stdlib compress/bzip2 package, github.com/dsnet/compress/bzip2 package, and bzip2 CLI tool all agree that the input is faulty, but they all seem to emit the same number of bytes before reporting an error.

This was my test program:

1447936 865689b5f5a000820a8d372f2e09f154 bzip2 data invalid: bad magic value in continuation file
1447936 865689b5f5a000820a8d372f2e09f154 bzip2: corrupted input: invalid stream magic
1447936 865689b5f5a000820a8d372f2e09f154 bzip2: (stdin): trailing garbage after EOF ignored

Strictly speaking, the CLI does not report an error (in terms of a non-zero status code, but does spit out a warning to stderr).

Given that data is at least produced, perhaps this requires archiver to ignore a corrupted input error? That said, I don't see you could possibly distinguish between an actual data corruption error versus some other semantic data that happens to be at the end.

mholt · 2024-06-10T14:09:15Z

Thanks Joe -- that's a smart idea for troubleshooting.

Hmm, yeah, that is tricky then. The file is kind of invalid then? I am not sure how we are going to distinguish that. For now I'm inclined to leave archiver unchanged in this regard and reject faulty inputs. Unless another possibility is introduced.

Thanks again for chiming in!

rgmz · 2024-06-10T14:09:29Z

That said, I don't see you could possibly distinguish between an actual data corruption error versus some other semantic data that happens to be at the end.

Funny you should say that. Taking a closer look at the file, it seems like the trailing data is a quirk of .dmg files embedding their PLIST at the end. I found the same thing in older bzip2-based DMG files like transmission-2.61.dmg and newer zlib-based ones like Rectangle0.80.dmg.

$ strings BitcoinTemplate.dmg
...
@@P@
BZh11AY&SYe7
B+"T
@<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
        <key>resource-fork</key>
        <dict>
                <key>blkx</key>
                <array>
                        <dict>
                                <key>Attributes</key>
                                <string>0x0050</string>
... (continues for about 100 more lines)

It seems like DMG files can be a variety of compression formats.

The first noteable fact about the DMG file format is, that there is no DMG file format. DMGs come in a variety of sub-formats, corresponding to the different tools which create them, and their compression schemes.
...
The common denominator of most of these is the existence of a 512-byte trailer at the end of the file. This trailer is identifiable by a magic 32-bit value, 0x6B6F6C79, which is "koly" in ASCII.

http://www.newosxbook.com/DMG.html

$ strings Rectangle0.80.dmg | tail -n 2
</plist>
koly

$ strings transmission-2.61.dmg | tail -n 2
</plist>
koly

$ strings BitcoinTemplate.dmg | tail -n 2
</plist>
koly

Edit: Interestingly, some DMG files are structured in a way where there's no compression/archive headers.

# https://www.jetbrains.com/idea/download/?section=mac
$ file -i ideaIU-2024.1.2.dmg
ideaIU-2024.1.2.dmg: application/octet-stream; charset=binary

rgmz · 2024-06-10T14:44:00Z

Hmm, yeah, that is tricky then. The file is kind of invalid then? I am not sure how we are going to distinguish that. For now I'm inclined to leave archiver unchanged in this regard and reject faulty inputs. Unless another possibility is introduced.

@mholt I agree with rejecting faulty inputs, it wouldn't make sense to change this behaviour in general.

It could make sense to add some special handling for for *.dmg files, such as special logic to ignore the trailing XML + koly during read, or an error saying "DMG files not supported" to prevent confusion. That's obviously up to you.

mholt · 2024-06-10T15:36:33Z

Sure; but we can only do that if we can seek to the end of the stream -- turns out only a few streams can do that (mainly files and buffers). So we'd only be able to offer a helpful error message in some cases I suppose, but ideally we'd also couple this with a check on the error type rather than relying on string comparison.

rgmz · 2024-06-10T18:16:02Z

Sure; but we can only do that if we can seek to the end of the stream -- turns out only a few streams can do that (mainly files and buffers).

So possible but not easy or reliable. The filename is probably the easiest/most reliable way to identify a *.dmg file. ¯\(ツ)/¯

but ideally we'd also couple this with a check on the error type rather than relying on string comparison.

Can you elaborate what you mean by this? I'm happy to open a PR addressing this, with a little guidance.

dsnet · 2024-06-10T18:22:53Z

Can you elaborate what you mean by this?

@mholt is referring to the ability to distinguishing error values without resorting to doing some form of string parsing on error.Error. The github.com/dsnet/compress package has an Error interface with an IsCorrupted boolean method. The error API was designed before the invention the more recent errors.Is APIs, but suspect that they are compatible.

mholt · 2024-06-10T19:06:12Z

Ah yeah that's what I mean. If there's already a specific error value/type then that's perfect, we can probably roll with that. And just seek when it is a Seeker.

mholt closed this as not planned Won't fix, can't repro, duplicate, stale Jun 7, 2024

mholt added the invalid label Jun 7, 2024

rgmz mentioned this issue Jun 13, 2024

Expose gzip.Reader.Multistream #407

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DMG/`bzip2` file read error: `bzip2: corrupted input: invalid stream magic` #405

DMG/`bzip2` file read error: `bzip2: corrupted input: invalid stream magic` #405

rgmz commented Jun 7, 2024 •

edited

Loading

mholt commented Jun 7, 2024

rgmz commented Jun 7, 2024

mholt commented Jun 7, 2024

mholt commented Jun 8, 2024

rgmz commented Jun 8, 2024

dsnet commented Jun 10, 2024

mholt commented Jun 10, 2024

rgmz commented Jun 10, 2024 •

edited

Loading

rgmz commented Jun 10, 2024

mholt commented Jun 10, 2024

rgmz commented Jun 10, 2024

dsnet commented Jun 10, 2024

mholt commented Jun 10, 2024

DMG/bzip2 file read error: bzip2: corrupted input: invalid stream magic #405

DMG/bzip2 file read error: bzip2: corrupted input: invalid stream magic #405

Comments

rgmz commented Jun 7, 2024 • edited Loading

What version of the package or command are you using?

What are you trying to do?

What steps did you take?

What did you expect to happen, and what actually happened instead?

How do you think this should be fixed?

Please link to any related issues, pull requests, and/or discussion

Bonus: What do you use archiver for, and do you find it useful?

mholt commented Jun 7, 2024

rgmz commented Jun 7, 2024

mholt commented Jun 7, 2024

mholt commented Jun 8, 2024

rgmz commented Jun 8, 2024

dsnet commented Jun 10, 2024

mholt commented Jun 10, 2024

rgmz commented Jun 10, 2024 • edited Loading

rgmz commented Jun 10, 2024

mholt commented Jun 10, 2024

rgmz commented Jun 10, 2024

dsnet commented Jun 10, 2024

mholt commented Jun 10, 2024

DMG/`bzip2` file read error: `bzip2: corrupted input: invalid stream magic` #405

DMG/`bzip2` file read error: `bzip2: corrupted input: invalid stream magic` #405

rgmz commented Jun 7, 2024 •

edited

Loading

rgmz commented Jun 10, 2024 •

edited

Loading