ext3/ext4 is not fully POSIX, to be safe there, need to fsync after file size changes #284

tv42 · 2015-01-16T20:19:48Z

@cespare pointed out this conversation to me: http://www.openldap.org/lists/openldap-devel/201411/msg00000.html (I've seen the talk before, but didn't pay enough attention).

Reading http://linux.die.net/man/2/fdatasync says (my emphasis) "fdatasync() [...] does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. [...] On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush."

Quoting the paper (my emphasis): "On ext3 with the default “ordered” journaling mode, the file data is forced directly out to the main file system prior to its metadata being committed to the journal. This is why we observe the journaling of the length update (op#399, 400, and 402) after the file data updates(op#342–398)." "[...] means fdatasync on ext3 does not wait for the completion of journaling (similar behavior has been observed on ext4)."

Currently, bolt seems to rely on individual page writes increasing the st_size of the file; there's no file size change where it actually notices it needs more space:

bolt/db.go

Line 554 in 15a58b0

// Resize mmap() if we're at the end.

So, my takeaway from this is, ext3, and likely ext4, have a bug / "design tradeoff". If bolt wants to be pragmatic it should probably accommodate for that bug.

I think it would be enough to do something like this:

    if minsz >= db.datasz {
        if err := db.file.Truncate(minsz); err != nil {
            return nil, fmt.Errorf("file resize error: %s", err)
        }
        if err := db.file.Sync(); err != nil {
            return nil, fmt.Errorf("file sync error: %s", err)
        }
        if err := db.mmap(minsz); err != nil {
            return nil, fmt.Errorf("mmap allocate error: %s", err)
        }
    }

Doing the resize once, and not page-by-page when writing out dirty pages, might even be more efficient.

benbjohnson · 2015-01-18T15:31:36Z

Awesome, thanks for the fix! I moved the code over to bolt_unix.go#mmap(). That way it occurs after the max mmap size check and, since it's in the Unix-specific file, I don't have to re-test on Windows. :)

#286

benbjohnson · 2015-05-01T20:24:32Z

@tv42 Is the Truncate() necessary in this change? Shouldn't the fsync() be enough to flush the metadata? I'm seeing really long wait times on Truncate() sometimes on XFS.

tv42 · 2015-05-01T22:23:57Z

@benbjohnson The truncate is the part that resizes the file; if you skip that, the file remains small until you write to it, and since data is only synced with fdatasync, that doesn't need to write the new size -> after a crash, the part that was after the old size of the file may be lost. The delay you're seeing is probably xfs actually allocating space for the file, so it can later guarantee writes to it will be "easy".

If you remove either the truncate or the sync, it's as if this fix never went in.

benbjohnson · 2015-05-04T16:56:48Z

@tv42 Thanks, my question was dumb now that I read it again. :)

I'm going to add a DB.NoTruncate flag to skip truncation. It'll be false by default but turning it on will avoid the truncation step. I'm still not able to figure out why xfs is hanging for me. It's taking minutes with no events out of strace or any io stats tools. The flag will also allow bolt to not allocate space up front when it doesn't need to.

benbjohnson · 2015-05-04T16:59:04Z

/cc @mkobetic @bouk

This commit adds the DB.NoTruncate flag to optionally revert mmap() calls to how they were implemented before the ext3/ext4 fix. When NoTruncate is true, remapping the data file will not force the file system to resize it immediately. This works for non-ext3/4 file systems. The default value of NoTruncate is false so it is still safe for ext3/ext4 file systems by default. See also: boltdb#284

This commit adds the DB.NoGrowSync flag to optionally revert mmap() calls to how they were implemented before the ext3/ext4 fix. When NoGrowSync is true, remapping the data file will not force the file system to resize it immediately. This works for non-ext3/4 file systems. The default value of NoGrowSync is false so it is still safe for ext3/ext4 file systems by default. See also: boltdb#284

benbjohnson mentioned this issue Jan 18, 2015

Add truncate() and sync() on resize. #286

Merged

benbjohnson closed this as completed in #286 Jan 18, 2015

benbjohnson mentioned this issue Feb 18, 2015

database file size not updating? #308

Closed

benbjohnson mentioned this issue May 5, 2015

Add DB.NoTruncate flag. #363

Merged

funny-falcon mentioned this issue Aug 13, 2015

no need to call Truncate and Sync if size is not changed #407

Closed

This was referenced Jan 17, 2023

why tx.db.grow() is not needed in tx.root.spill() etcd-io/bbolt#285

Closed

fix db.grow is unusable when NoFreelistSync is on etcd-io/bbolt#387

Merged

fuweid mentioned this issue Nov 19, 2023

Use file.Sync() to flush data on linux platform etcd-io/bbolt#615

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ext3/ext4 is not fully POSIX, to be safe there, need to fsync after file size changes #284

ext3/ext4 is not fully POSIX, to be safe there, need to fsync after file size changes #284

tv42 commented Jan 16, 2015

benbjohnson commented Jan 18, 2015

benbjohnson commented May 1, 2015

tv42 commented May 1, 2015

benbjohnson commented May 4, 2015

benbjohnson commented May 4, 2015

ext3/ext4 is not fully POSIX, to be safe there, need to fsync after file size changes #284

ext3/ext4 is not fully POSIX, to be safe there, need to fsync after file size changes #284

Comments

tv42 commented Jan 16, 2015

benbjohnson commented Jan 18, 2015

benbjohnson commented May 1, 2015

tv42 commented May 1, 2015

benbjohnson commented May 4, 2015

benbjohnson commented May 4, 2015