Skip to content
This repository has been archived by the owner on Mar 9, 2019. It is now read-only.

Crashes can cause data corruption #548

Closed
cyphar opened this issue Apr 6, 2016 · 8 comments
Closed

Crashes can cause data corruption #548

cyphar opened this issue Apr 6, 2016 · 8 comments

Comments

@cyphar
Copy link
Contributor

cyphar commented Apr 6, 2016

This was noticed while testing a Docker bug where Docker was killed by the kernel due to a resource allocation issue and the resultant Bolt database is in a corrupted state. It's a bit hard to test this, and I'm trying to get a nice test case that can be tested just using bolt. Examples of the errors include:

FATA[0000] Error starting daemon: Error initializing network controller: error obtaining controller instance: failed to get bridge network configurations from store: error while populating kmap: meta0 error: checksum error

The issue is clearly that the file wasn't written properly, presumably the mapping wasn't properly synced or something like that. Since we're using mmap'd pages I'm not sure what happens if bolt is in the middle of modifying the in-memory state of the database and then it crashes -- I assume the kernel will flush the page cache eventually. This is obviously going to be a problem because the state may not be valid.

In addition, we use fdatasync rather than msync. But if you look at the msync(2) man page it makes clear you should only use it:

msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to the filesystem. Without use of this call, there is no guarantee that changes are written back before munmap(2) is called. [emphasis added]

@cyphar cyphar changed the title Crashes can cause memory corruption Crashes can cause data corruption Apr 6, 2016
@benbjohnson
Copy link
Member

we use fdatasync rather than msync

Bolt uses a read only memory mmap for reading the data (syscall.PROT_READ). All changes go through a file descriptor so that's why fdatasync() is used.

The issue is clearly that the file wasn't written properly

The issue looks like one of the meta pages was partially written and bolt is not rolling back to the previous transaction correctly. On the plus side, the data file isn't corrupt. The recovery code just needs to fall back if a checksum error occurs.

@cyphar
Copy link
Contributor Author

cyphar commented Apr 6, 2016

The issue is clearly that the file wasn't written properly

The issue looks like one of the meta pages was partially written and bolt is not rolling back to the previous transaction correctly. On the plus side, the data file isn't corrupt. The recovery code just needs to fall back if a checksum error occurs.

Maybe I don't understand how boltdb works internally, but is "partially written" data (meaning the process died while modifying the memory that was to be flushed to disk) not a problem? No matter which part of the data it is? I guess that makes sense if transactions are immutable, but are you sure there's no place where this could go badly (like rewriting references to other transactions or something?

@benbjohnson
Copy link
Member

is "partially written" data (meaning the process died while modifying the memory that was to be flushed to disk) not a problem?

During a commit, there are two phases:

  1. Write data pages/freelist & fsync()
  2. Write meta page & fsync().

Partial writes aren't a problem for data corruption. If a partial write happens in phase 1 then there's no meta page to point to the newly written data pages & freelist so those pages are effectively ignored. The data pages and freelist are written to what were unused, free pages in the previous transaction so overwriting those pages doesn't do anything.

Bolt uses two meta pages at the beginning of the data file to protect against partial meta page writes. Each new transaction will flip between the pages. Txn 1 writes to meta 0, txn 2 writes to meta 1, txn 3 writes to meta 0, txn 4 writes to meta 1, etc. Since the meta pages are checksummed, if there is a partial write to one of them then Bolt should detect that and fallback to the previous meta page. The fallback is the part that's not working right now.

are you sure there's no place where this could go badly (like rewriting references to other transactions or something?

Bolt is a pretty simple database. There aren't any txn references to rewrite or write ahead logs or compactions or really anything too fancy going on. The Tx.Commit() is the only function which actually writes out to the data file and it has a well defined behavior (e.g. the two phases).

@xiang90
Copy link
Contributor

xiang90 commented Apr 6, 2016

@cyphar What is the commit hash of bolt you are using? Are you able to reproduce the corruption?

@cyphar
Copy link
Contributor Author

cyphar commented Apr 18, 2016

@xiang90 This happens with all versions of Docker, the latest of which has vendored v1.2.0. I haven't had a chance to reproduce the problem, but as @benbjohnson said this is a known unimplemented feature of boltdb.

However, this is really frustrating because this error means that people who use Docker can't actually start their daemon without nuking their entire setup if the database is in this state.

@kron4eg
Copy link

kron4eg commented Apr 19, 2016

The fallback is the part that's not working right now

That is unfortunate... @benbjohnson maybe if you could provide some clue on how it should behave someone will make an PR.

As I understand here: https://github.com/boltdb/bolt/blob/master/db.go#L202-L204 it should try to validate another page, and here https://github.com/boltdb/bolt/blob/master/db.go#L266-L271 it should somehow gracefully degrade and reset meta (but how? I'm not familiar with source).

@benbjohnson
Copy link
Member

@kron4eg It's relatively easy fix. I've been trying to find time in the last week to fix it but I simply haven't been able to get to it. Thanks for digging into it.

You're on the right track. Here's some more info:

  • [db.go#L202-204](https://github.com/boltdb/bolt/blob/master/db.go#L202-L204) - Bolt sets the page size on the database based on the page size of the OS that it's created on. That's what we're trying to read from the first page. It's trying to validate the first page but if it's incorrect then we should probably just assume the os.Getpagesize()` instead of returning an error.
  • db.goL266-L271 - This should be changed to only return an error if both meta0 and meta1 return an error from validate().
  • db.go#L779-L785 - This function should be changed to return the meta page with the highest txid where meta.validate() returns nil. If both meta pages are invalid then it should just panic(). That situation shouldn't ever occur since we're validating on the mmap() and we're doing fsync() on the writes.

This corruption should be exceedingly rare since disks will generally write a full sector (512 bytes) even in the event of power loss. The meta pages are page aligned and the meta data fits within a sector.

@cyphar
Copy link
Contributor Author

cyphar commented Apr 19, 2016

@benbjohnson Using the dotpoints you provided, I made a PR (#555). PTAL. The main concern I had was with your first dot point and whether that would cause issues if you had an invalid pagesize and you were opening the database on a different operating system than the one you created it on.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants