Crashes can cause data corruption #548

cyphar · 2016-04-06T13:43:52Z

This was noticed while testing a Docker bug where Docker was killed by the kernel due to a resource allocation issue and the resultant Bolt database is in a corrupted state. It's a bit hard to test this, and I'm trying to get a nice test case that can be tested just using bolt. Examples of the errors include:

FATA[0000] Error starting daemon: Error initializing network controller: error obtaining controller instance: failed to get bridge network configurations from store: error while populating kmap: meta0 error: checksum error

The issue is clearly that the file wasn't written properly, presumably the mapping wasn't properly synced or something like that. Since we're using mmap'd pages I'm not sure what happens if bolt is in the middle of modifying the in-memory state of the database and then it crashes -- I assume the kernel will flush the page cache eventually. This is obviously going to be a problem because the state may not be valid.

In addition, we use fdatasync rather than msync. But if you look at the msync(2) man page it makes clear you should only use it:

msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to the filesystem. Without use of this call, there is no guarantee that changes are written back before munmap(2) is called. [emphasis added]

The text was updated successfully, but these errors were encountered:

benbjohnson · 2016-04-06T14:23:50Z

we use fdatasync rather than msync

Bolt uses a read only memory mmap for reading the data (syscall.PROT_READ). All changes go through a file descriptor so that's why fdatasync() is used.

The issue is clearly that the file wasn't written properly

The issue looks like one of the meta pages was partially written and bolt is not rolling back to the previous transaction correctly. On the plus side, the data file isn't corrupt. The recovery code just needs to fall back if a checksum error occurs.

cyphar · 2016-04-06T14:31:04Z

The issue is clearly that the file wasn't written properly

The issue looks like one of the meta pages was partially written and bolt is not rolling back to the previous transaction correctly. On the plus side, the data file isn't corrupt. The recovery code just needs to fall back if a checksum error occurs.

Maybe I don't understand how boltdb works internally, but is "partially written" data (meaning the process died while modifying the memory that was to be flushed to disk) not a problem? No matter which part of the data it is? I guess that makes sense if transactions are immutable, but are you sure there's no place where this could go badly (like rewriting references to other transactions or something?

benbjohnson · 2016-04-06T14:54:01Z

is "partially written" data (meaning the process died while modifying the memory that was to be flushed to disk) not a problem?

During a commit, there are two phases:

Write data pages/freelist & fsync()
Write meta page & fsync().

Partial writes aren't a problem for data corruption. If a partial write happens in phase 1 then there's no meta page to point to the newly written data pages & freelist so those pages are effectively ignored. The data pages and freelist are written to what were unused, free pages in the previous transaction so overwriting those pages doesn't do anything.

Bolt uses two meta pages at the beginning of the data file to protect against partial meta page writes. Each new transaction will flip between the pages. Txn 1 writes to meta 0, txn 2 writes to meta 1, txn 3 writes to meta 0, txn 4 writes to meta 1, etc. Since the meta pages are checksummed, if there is a partial write to one of them then Bolt should detect that and fallback to the previous meta page. The fallback is the part that's not working right now.

are you sure there's no place where this could go badly (like rewriting references to other transactions or something?

Bolt is a pretty simple database. There aren't any txn references to rewrite or write ahead logs or compactions or really anything too fancy going on. The Tx.Commit() is the only function which actually writes out to the data file and it has a well defined behavior (e.g. the two phases).

xiang90 · 2016-04-06T21:28:06Z

@cyphar What is the commit hash of bolt you are using? Are you able to reproduce the corruption?

cyphar · 2016-04-18T12:28:32Z

@xiang90 This happens with all versions of Docker, the latest of which has vendored v1.2.0. I haven't had a chance to reproduce the problem, but as @benbjohnson said this is a known unimplemented feature of boltdb.

However, this is really frustrating because this error means that people who use Docker can't actually start their daemon without nuking their entire setup if the database is in this state.

kron4eg · 2016-04-19T09:03:18Z

The fallback is the part that's not working right now

That is unfortunate... @benbjohnson maybe if you could provide some clue on how it should behave someone will make an PR.

As I understand here: https://github.com/boltdb/bolt/blob/master/db.go#L202-L204 it should try to validate another page, and here https://github.com/boltdb/bolt/blob/master/db.go#L266-L271 it should somehow gracefully degrade and reset meta (but how? I'm not familiar with source).

benbjohnson · 2016-04-19T15:04:17Z

@kron4eg It's relatively easy fix. I've been trying to find time in the last week to fix it but I simply haven't been able to get to it. Thanks for digging into it.

You're on the right track. Here's some more info:

[db.go#L202-204](https://github.com/boltdb/bolt/blob/master/db.go#L202-L204) - Bolt sets the page size on the database based on the page size of the OS that it's created on. That's what we're trying to read from the first page. It's trying to validate the first page but if it's incorrect then we should probably just assume the os.Getpagesize()` instead of returning an error.
db.goL266-L271 - This should be changed to only return an error if both meta0 and meta1 return an error from validate().
db.go#L779-L785 - This function should be changed to return the meta page with the highest txid where meta.validate() returns nil. If both meta pages are invalid then it should just panic(). That situation shouldn't ever occur since we're validating on the mmap() and we're doing fsync() on the writes.

This corruption should be exceedingly rare since disks will generally write a full sector (512 bytes) even in the event of power loss. The meta pages are page aligned and the meta data fits within a sector.

cyphar · 2016-04-19T21:59:02Z

@benbjohnson Using the dotpoints you provided, I made a PR (#555). PTAL. The main concern I had was with your first dot point and whether that would cause issues if you had an invalid pagesize and you were opening the database on a different operating system than the one you created it on.

cyphar changed the title ~~Crashes can cause memory corruption~~ Crashes can cause data corruption Apr 6, 2016

cyphar mentioned this issue Apr 19, 2016

db: fix recovery from unsynced metadata #555

Merged

benbjohnson closed this as completed in #555 Apr 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashes can cause data corruption #548

Crashes can cause data corruption #548

cyphar commented Apr 6, 2016

benbjohnson commented Apr 6, 2016

cyphar commented Apr 6, 2016

benbjohnson commented Apr 6, 2016

xiang90 commented Apr 6, 2016

cyphar commented Apr 18, 2016

kron4eg commented Apr 19, 2016

benbjohnson commented Apr 19, 2016

cyphar commented Apr 19, 2016

Crashes can cause data corruption #548

Crashes can cause data corruption #548

Comments

cyphar commented Apr 6, 2016

benbjohnson commented Apr 6, 2016

cyphar commented Apr 6, 2016

benbjohnson commented Apr 6, 2016

xiang90 commented Apr 6, 2016

cyphar commented Apr 18, 2016

kron4eg commented Apr 19, 2016

benbjohnson commented Apr 19, 2016

cyphar commented Apr 19, 2016