Skip to content
This repository has been archived by the owner on Nov 19, 2024. It is now read-only.

DMG/bzip2 file read error: bzip2: corrupted input: invalid stream magic #405

Closed
rgmz opened this issue Jun 7, 2024 · 13 comments
Closed
Labels

Comments

@rgmz
Copy link
Contributor

rgmz commented Jun 7, 2024

What version of the package or command are you using?

v4.0.0-alpha.8

What are you trying to do?

Decompress a .dmg file detected as .bzip2.

$ file -i BitcoinTemplate.dmg
BitcoinTemplate.dmg: application/x-bzip2; charset=binary

Note that there is a warning about "trailing garbage" when decompressing it using the CLI, however, it still succeeds.

$ bzip2 -d BitcoinTemplate.dmg
bzip2: Can't guess original name for BitcoinTemplate.dmg -- using BitcoinTemplate.dmg.out

bzip2: BitcoinTemplate.dmg: trailing garbage after EOF ignored

What steps did you take?

  1. Download https://github.com/cicorias/bitcoin/raw/a4f2c8419f5a8b47c4b48e962b4ff02885164c9b/contrib/BitcoinTemplate.dmg
  2. Run the following test
    func TestDmg(t *testing.T) {
        f, err := os.Open("/tmp/BitcoinTemplate.dmg")
        require.NoError(t, err)
    
        format, f2, err := archiver.Identify("/tmp/BitcoinTemplate.dmg", f)
        require.NoError(t, err)
    
        rc, err := format.(archiver.Decompressor).OpenReader(f2)
        require.NoError(t, err)
    
        _, err = io.ReadAll(rc)
        require.NoError(t, err)
    
        assert.NoError(t, rc.Close())
    }
  3. Receive error from io.ReadAll
     Error:      	Received unexpected error:
        	         bzip2: corrupted input: invalid stream magic
    

What did you expect to happen, and what actually happened instead?

I expected the file to be identified and decompressed successfully.

How do you think this should be fixed?

Unsure.

Please link to any related issues, pull requests, and/or discussion

trufflesecurity/trufflehog#2935

Bonus: What do you use archiver for, and do you find it useful?

@mholt
Copy link
Owner

mholt commented Jun 7, 2024

As the docs say:

If stream is non-nil then the returned io.Reader will always be non-nil and will read from the same point as the reader which was passed in; it should be used in place of the input stream after calling Identify() because it preserves and re-reads the bytes that were already read during the identification process.

You're passing in a stream that gets read, but ignoring the returned stream (_) causing corruption to the reader.

@mholt mholt closed this as not planned Won't fix, can't repro, duplicate, stale Jun 7, 2024
@mholt mholt added the invalid label Jun 7, 2024
@rgmz
Copy link
Contributor Author

rgmz commented Jun 7, 2024

Hi Matt,

That appears to be a typo on my part from trying to condense the real messy code into a neat reproducer.

The error still occurs when the returned stream isn't ignored. I've updated the example accordingly.

@mholt
Copy link
Owner

mholt commented Jun 7, 2024

Oh, ok. I will take a look.

@mholt
Copy link
Owner

mholt commented Jun 8, 2024

I get the same error with that file, but I also do when just using the underlying bzip2 lib directly. I would open the issue with @dsnet -- he's quite smart and can probably tell you whether it's a bug in bzip2 package or possibly this DMG is malformed.

@rgmz
Copy link
Contributor Author

rgmz commented Jun 8, 2024

Will do. Thank you for the insight!

@dsnet
Copy link
Contributor

dsnet commented Jun 10, 2024

Hey, thanks for the ping.

It seems to me that the Go stdlib compress/bzip2 package, github.com/dsnet/compress/bzip2 package, and bzip2 CLI tool all agree that the input is faulty, but they all seem to emit the same number of bytes before reporting an error.

This was my test program:

1447936 865689b5f5a000820a8d372f2e09f154 bzip2 data invalid: bad magic value in continuation file
1447936 865689b5f5a000820a8d372f2e09f154 bzip2: corrupted input: invalid stream magic
1447936 865689b5f5a000820a8d372f2e09f154 bzip2: (stdin): trailing garbage after EOF ignored

Strictly speaking, the CLI does not report an error (in terms of a non-zero status code, but does spit out a warning to stderr).

Given that data is at least produced, perhaps this requires archiver to ignore a corrupted input error? That said, I don't see you could possibly distinguish between an actual data corruption error versus some other semantic data that happens to be at the end.

@mholt
Copy link
Owner

mholt commented Jun 10, 2024

Thanks Joe -- that's a smart idea for troubleshooting.

Hmm, yeah, that is tricky then. The file is kind of invalid then? I am not sure how we are going to distinguish that. For now I'm inclined to leave archiver unchanged in this regard and reject faulty inputs. Unless another possibility is introduced.

Thanks again for chiming in!

@rgmz
Copy link
Contributor Author

rgmz commented Jun 10, 2024

That said, I don't see you could possibly distinguish between an actual data corruption error versus some other semantic data that happens to be at the end.

Funny you should say that. Taking a closer look at the file, it seems like the trailing data is a quirk of .dmg files embedding their PLIST at the end. I found the same thing in older bzip2-based DMG files like transmission-2.61.dmg and newer zlib-based ones like Rectangle0.80.dmg.

$ strings BitcoinTemplate.dmg
...
@@P@
BZh11AY&SYe7
B+"T
@<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
        <key>resource-fork</key>
        <dict>
                <key>blkx</key>
                <array>
                        <dict>
                                <key>Attributes</key>
                                <string>0x0050</string>
... (continues for about 100 more lines)

It seems like DMG files can be a variety of compression formats.

The first noteable fact about the DMG file format is, that there is no DMG file format. DMGs come in a variety of sub-formats, corresponding to the different tools which create them, and their compression schemes.
...
The common denominator of most of these is the existence of a 512-byte trailer at the end of the file. This trailer is identifiable by a magic 32-bit value, 0x6B6F6C79, which is "koly" in ASCII.

http://www.newosxbook.com/DMG.html

$ strings Rectangle0.80.dmg | tail -n 2
</plist>
koly

$ strings transmission-2.61.dmg | tail -n 2
</plist>
koly

$ strings BitcoinTemplate.dmg | tail -n 2
</plist>
koly

Edit: Interestingly, some DMG files are structured in a way where there's no compression/archive headers.

# https://www.jetbrains.com/idea/download/?section=mac
$ file -i ideaIU-2024.1.2.dmg
ideaIU-2024.1.2.dmg: application/octet-stream; charset=binary

@rgmz
Copy link
Contributor Author

rgmz commented Jun 10, 2024

Hmm, yeah, that is tricky then. The file is kind of invalid then? I am not sure how we are going to distinguish that. For now I'm inclined to leave archiver unchanged in this regard and reject faulty inputs. Unless another possibility is introduced.

@mholt I agree with rejecting faulty inputs, it wouldn't make sense to change this behaviour in general.

It could make sense to add some special handling for for *.dmg files, such as special logic to ignore the trailing XML + koly during read, or an error saying "DMG files not supported" to prevent confusion. That's obviously up to you.

@mholt
Copy link
Owner

mholt commented Jun 10, 2024

Sure; but we can only do that if we can seek to the end of the stream -- turns out only a few streams can do that (mainly files and buffers). So we'd only be able to offer a helpful error message in some cases I suppose, but ideally we'd also couple this with a check on the error type rather than relying on string comparison.

@rgmz
Copy link
Contributor Author

rgmz commented Jun 10, 2024

Sure; but we can only do that if we can seek to the end of the stream -- turns out only a few streams can do that (mainly files and buffers).

So possible but not easy or reliable. The filename is probably the easiest/most reliable way to identify a *.dmg file. ¯\(ツ)

but ideally we'd also couple this with a check on the error type rather than relying on string comparison.

Can you elaborate what you mean by this? I'm happy to open a PR addressing this, with a little guidance.

@dsnet
Copy link
Contributor

dsnet commented Jun 10, 2024

Can you elaborate what you mean by this?

@mholt is referring to the ability to distinguishing error values without resorting to doing some form of string parsing on error.Error. The github.com/dsnet/compress package has an Error interface with an IsCorrupted boolean method. The error API was designed before the invention the more recent errors.Is APIs, but suspect that they are compatible.

@mholt
Copy link
Owner

mholt commented Jun 10, 2024

Ah yeah that's what I mean. If there's already a specific error value/type then that's perfect, we can probably roll with that. And just seek when it is a Seeker.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants