-
Notifications
You must be signed in to change notification settings - Fork 20.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core/rawdb: fix cornercase shutdown behaviour in freezer #26485
Conversation
select { | ||
case <-f.quit: | ||
default: | ||
close(f.quit) | ||
} | ||
f.wg.Wait() | ||
return err | ||
return f.Freezer.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
core/rawdb/freezer_table.go
Outdated
@@ -869,7 +869,9 @@ func (t *freezerTable) advanceHead() error { | |||
func (t *freezerTable) Sync() error { | |||
t.lock.Lock() | |||
defer t.lock.Unlock() | |||
|
|||
if t.index == nil || t.head == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also check if t.meta
is nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fix is correct, we should include it although not sure if it's root cause
core/rawdb/freezer_test.go
Outdated
if err := f.Sync(); err != nil { | ||
t.Fatal(err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I'll just change this, and expect an error, and that the error (or error-string) is "closed", then?
log.Info("Failed to retrieve ancient root", "err", err) | ||
return err | ||
} | ||
ancient := stack.ResolveAncient("chaindata", ctx.String(utils.AncientFlag.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for this change? Directly resolve ancient datadir without involving the chain DB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for the investigation here: #26483 (comment) , I needed to do a geth db inspect
, but I didn't actually have a leveldb -- I only had a couple of index files.
This whole utils.MakeChainDatabase
assumes things to be pretty well ordered, but all we use it for eventually is to help resolve the ancientdir, via db.AncientDir
.
So this change has the same effect as the original code, but is more robust in case the data is not fully consistent / present.
) This PR does a few things. It fixes a shutdown-order flaw in the chainfreezer. Previously, the chain-freezer would shutdown the freezer backend first, and then signal for the loop to exit. This can lead to a scenario where the freezer tries to fsync closed files, which is an error-conditon that could lead to exit via log.Crit. It also makes the printout more detailed when truncating 'dangling' items, by showing the exact number instead of approximate MB. This PR also adds calls to fsync files before closing them, and also makes the `db inspect` command slightly more robust.
This PR does a few things. First of all, it makes the printout more detailed, e.g.
by showing the exact number instead of approximate
MB
.Secondly, it adds a testcase showing a cornercase that can occur during shutdown, if
freezer
isClose()
d, and only afterwards the chain freezer callsSync
. Currently, this would lead to a exit withCRIT
.The chain_freezer does writes here: https://github.com/ethereum/go-ethereum/blob/master/core/rawdb/chain_freezer.go#L162
followed by
Sync
here: https://github.com/ethereum/go-ethereum/blob/master/core/rawdb/chain_freezer.go#L170Between these operation, the underlying
f
may be closed by another routine.I saw also that the chain freezer does exactly this 'dangerous' sort of shutdown sequence: first it shuts down the underlying database, and only after does it signal to the active goroutine to exit. This PR fixes this. BUT: I suspect that the same sequence may happen regardless, depending on the shutdown sequence. I haven't investigated that in full.
However, this PR adds a failing test. I am not sure what the best fix is. Should we remove the test? Should we expect
Sync after Close
not to yield an error?