Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add protection when mmap somehow fails #362

Merged
merged 2 commits into from
Jan 12, 2023
Merged

Add protection when mmap somehow fails #362

merged 2 commits into from
Jan 12, 2023

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Dec 21, 2022

Provent bbolt from panicking in case #262

In (*DB) mmap, it munmap the file firstly, then mmap it again. If the munmap somehow fails, then the db.data is reset to nil. In this case, there is no need to execute rollback.

Signed-off-by: Benjamin Wang wachao@vmware.com

@ptabor
Copy link
Contributor

ptabor commented Dec 21, 2022

Thank you @ahrtr. This seems alligned with bbolt_unix.go so LGTM.
Seems we should be running our tests on windows as well (#364).

I might missing something but in general the db.data=nil path looks scary to me:

  • db.meta1, db.meta2 pages keep pointing on some page allocated within the db.data variable...
    seems golang will not free the memory (https://groups.google.com/g/golang-nuts/c/yNis7bQG_rY/m/yaJFoSx1hgIJ) but we have state that is inconsistent with some assumptions.
    We should have a single method that closes 'data' and all cleans or state depending on 'data' (like meta).

  • What happens if we post such rollback, will try to beging another transaction (e.g. db.Update(...)).
    From quick glance at the code I don't see any protection... and I expect panic to be just delayed till the next transaction attempt.

@boreq
Copy link

boreq commented Dec 22, 2022

I am on "unix" and since mmap in bolt_unix.go doesn't clear db.data on error something like this would be needed to make this fix work for me:

diff --git a/db.go b/db.go
index a798c39..93a09aa 100644
--- a/db.go
+++ b/db.go
@@ -373,6 +373,7 @@ func (db *DB) mmap(minsz int) error {
 
 	// Memory-map the data file as a byte slice.
 	if err := mmap(db, size); err != nil {
+		db.data = nil
 		return err
 	}
 

I think it is a better approach as it works for all platforms. That being said there are more edge cases where this can break things. I don't have the time to write a test for this right now I am afraid though.

goroutine 9 [running]:
runtime.throw({0x105c574ab?, 0x1057140d8?})
	runtime/panic.go:1047 +0x40 fp=0x1301c22f0 sp=0x1301c22c0 pc=0x10573d660
runtime.sigpanic()
	runtime/signal_unix.go:842 +0x1a0 fp=0x1301c2320 sp=0x1301c22f0 pc=0x1057543e0
go.etcd.io/bbolt.(*DB).meta(0x1082e8108?)
	go.etcd.io/bbolt@v1.3.6/db.go:949 +0x28 fp=0x1301c2380 sp=0x1301c2330 pc=0x10598e248
go.etcd.io/bbolt.(*Tx).init(0x1301b81c0, 0x1300ccfc0)
	go.etcd.io/bbolt@v1.3.6/tx.go:50 +0xa0 fp=0x1301c23f0 sp=0x1301c2380 pc=0x1059956f0
go.etcd.io/bbolt.(*DB).beginTx(0x1300ccfc0)
	go.etcd.io/bbolt@v1.3.6/db.go:615 +0x1cc fp=0x1301c2480 sp=0x1301c23f0 pc=0x10598c82c
go.etcd.io/bbolt.(*DB).Begin(0x1301c2568?, 0x6c?)
	go.etcd.io/bbolt@v1.3.6/db.go:592 +0x30 fp=0x1301c24a0 sp=0x1301c2480 pc=0x10598c630
go.etcd.io/bbolt.(*DB).View(0x1301c2548?, 0x1301c2578)
	go.etcd.io/bbolt@v1.3.6/db.go:757 +0x30 fp=0x1301c2510 sp=0x1301c24a0 pc=0x10598d3f0
...irrelevant stuff

To sum this up I don't think this fix is as simple as it seems.

@ahrtr ahrtr marked this pull request as draft December 23, 2022 23:34
@ahrtr ahrtr marked this pull request as ready for review January 10, 2023 07:16
@ahrtr ahrtr changed the title skip rollback if db or db.data is nil Add protection when mmap somehow fails Jan 10, 2023
@ahrtr
Copy link
Member Author

ahrtr commented Jan 10, 2023

Updated. PTAL @ptabor @boreq

@ahrtr ahrtr requested a review from ptabor January 10, 2023 07:22
@ahrtr ahrtr added this to the 1.3.7 milestone Jan 11, 2023
@ahrtr
Copy link
Member Author

ahrtr commented Jan 11, 2023

db.go Outdated
@@ -482,6 +482,14 @@ func (db *DB) mmap(minsz int) error {

// munmap unmaps the data file from memory.
func (db *DB) munmap() error {
defer func() {
db.dataref = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider moving it to explicit db.invalidateUnsafe() method
(and comment it should be execute under the exclusive mmapLock and metaLock)

Copy link
Member Author

@ahrtr ahrtr Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated (see below). I prefer not to have the suffix "Unsafe", and not to add the comment something like it should be execute under the exclusive mmapLock and metaLock as well. BoltDB has well-designed lock mechanism (see below), we shouldn't worry about the lock on each single internal method/function.

For write transaction, it requires the rwlock at db.rwlock.Lock(), and require the mmaplock when it needs to allocate more space accordingly needs to remap the db (of course release the lock when it finishes mapping), see db.mmaplock.Lock().

For readonly transaction, it only requires the mmaplock at db.mmaplock.RLock(), to prevent the db file from being remapped by anther write transaction.

The new added db.invalidateUnsafe() is only called in db.munmap(), so it's safe.

func (db *DB) invalidate() {
	db.dataref = nil
	db.data = nil
	db.datasz = 0

	db.meta0 = nil
	db.meta1 = nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is etcd convention (that I though is even golang convention, but I haven't found any signs) to call
methods 'UnsafeFoo' when the caller of such method needs to take care of proper locking before calling such method.

$ git grep Unsafe | grep func | nl

shows 93 cases.

I do think that the call of 'invalidate' is safe in that context. I just want to avoid situation that someone will call 'invalidate' from context where the locks are not taken... because the person haven't read carefully that the body of the method interacts with the db.data and db.meta and these fields must be mutated under the exclusive lock.

I'm not insisting on the "Unsafe" convention... but I don't think comment costs as here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, etcd has the convention (UnsafeFoo vs Foo), but there is NO such convention in bbolt for now. If we change it to invalidateUnsafe, then it isn't consistent with other internal methods.

So I'd like to keep it as it's for now.

Of course, please feel free to raise a separate ticket if you do think we should follow the same convention as etcd.

db_test.go Outdated Show resolved Hide resolved

db.meta0 = nil
db.meta1 = nil
}()
if err := munmap(db); err != nil {
return fmt.Errorf("unmap error: " + err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a topic for a separate PR... but I think we should start logging Errors early from the failed syscalls...

There is high probability they will lead to panic later (either on bbolt side, but most likely on customer'a application side) and there is significant probability that customer will not log the error themselves...
And this might lead to hard to diagnose problems of "etcd not working" while there was OOM or no disk space.

Failed syscalls is something that shoud not happen on well configured machines, so it should not make the app logs spammy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will think about this separately.

Signed-off-by: Benjamin Wang <wachao@vmware.com>
Signed-off-by: Benjamin Wang <wachao@vmware.com>
Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this.
Reported #382 to cover it with better (or any) integration testing.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 12, 2023

Thanks @ptabor for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants