-
Notifications
You must be signed in to change notification settings - Fork 20k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV from eth/downloader/downloader.go #22714
Comments
Can you provide a bit more log? |
Do you mean from before the crash? Because this is all there is after. OK, this it from the last run:
|
Interesting:
So we somehow have the genesis in there, but the genesis has |
This is what one gets starting a clean geth from scratch:
So something, somwhere, deleted the |
So what should I do? Ditch all 130GB and start again, or is there something else I can do? |
@pepa65 it seems your node hit a bug in go-ethereum core, and we find it valuable for debugging. if you just want to get on with life now, you should resync the node. unfortunately the problem cannot be fixed immediately. |
I've restarted it one more time without |
Do you have enough information to look into things?? |
Now running with: |
It would be interesting to see the startup portion of the logs, and see if it somehow got the td back for genesis or not. This line
|
After hours of running, a crash still occurred with These are all the lines from an earlier run:
|
FYI, without |
Interesting:
That is the genesis. So previously, it was |
If you execute |
For me, it spits out
Which is
And |
|
You will need to shut it down first |
It had just crashed anyway.
|
The mystery deepens. So, the correct TD is indeed in leveldb. Could you please go to |
(this time it doesn't matter if the node is running or not) |
|
Interesting. Here's how it should look:
If I display it with different cols, the diffs are as follows:
Your
None of that looks even remotely like it should (rlp-encoded |
So, actually, your data is not aligned on 6-bytes per entry, if I rearrange the columnwidth a bit, it's easy to see:
It's actually
|
This is also how it should be -- but a lot later:
So, as far as I can tell, you're missing This is really interesting case of data corruption. We will investigate what could possible be the cause for something like that to happen. Some follow-up questions:
|
I have a suspicion that what's causing this is a race between At least, that's the only path I could find where we actually open files in truncation mode, so it's the most likely culprit IMO. A rudimentary fix would be something like below. I'll try to trigger this behaviour via some testcase or custom binary (maybe a fuzzer), so we have something to test against. diff --git a/core/rawdb/freezer_table.go b/core/rawdb/freezer_table.go
index b614c10d37..13b2c79903 100644
--- a/core/rawdb/freezer_table.go
+++ b/core/rawdb/freezer_table.go
@@ -465,34 +465,30 @@ func (t *freezerTable) releaseFilesAfter(num uint32, remove bool) {
// Note, this method will *not* flush any data to disk so be sure to explicitly
// fsync before irreversibly deleting data from the database.
func (t *freezerTable) Append(item uint64, blob []byte) error {
+ // Encode the blob before the lock portion
+ if !t.noCompression {
+ blob = snappy.Encode(nil, blob)
+ }
// Read lock prevents competition with truncate
- t.lock.RLock()
+ t.lock.Lock()
+ defer t.lock.Unlock()
// Ensure the table is still accessible
if t.index == nil || t.head == nil {
- t.lock.RUnlock()
return errClosed
}
// Ensure only the next item can be written, nothing else
if atomic.LoadUint64(&t.items) != item {
- t.lock.RUnlock()
return fmt.Errorf("appending unexpected item: want %d, have %d", t.items, item)
}
- // Encode the blob and write it into the data file
- if !t.noCompression {
- blob = snappy.Encode(nil, blob)
- }
bLen := uint32(len(blob))
if t.headBytes+bLen < bLen ||
t.headBytes+bLen > t.maxFileSize {
// we need a new file, writing would overflow
- t.lock.RUnlock()
- t.lock.Lock()
nextID := atomic.LoadUint32(&t.headId) + 1
// We open the next file in truncated mode -- if this file already
// exists, we need to start over from scratch on it
newHead, err := t.openFile(nextID, openFreezerFileTruncated)
if err != nil {
- t.lock.Unlock()
return err
}
// Close old file, and reopen in RDONLY mode
@@ -503,11 +499,7 @@ func (t *freezerTable) Append(item uint64, blob []byte) error {
t.head = newHead
atomic.StoreUint32(&t.headBytes, 0)
atomic.StoreUint32(&t.headId, nextID)
- t.lock.Unlock()
- t.lock.RLock()
}
-
- defer t.lock.RUnlock()
if _, err := t.head.Write(blob); err != nil {
return err
}
|
A simpler (and more performant) fix would be to just re-check
after we obtain the writelock. But I think we need to have a stable repro in order to verify any fix. |
Hope you'll be able to improve the software. But what can it be that this only manifests on my machine/setup?? |
I think it's just something that occurs very very rarely, and you've hit it. Once it happens, though, the root problem causes spurious failures from time to time, not not every time, just once in a while. I would like to explore your ancient data a bit further, but we might need some new tooling for the analysis. As @fjl said, if you want to just continue and get a node synced, you can wipe the datadir and do a fresh sync, but if you can wait a few more days (or backup the ancient folder), we can do some more analysis on it, and that would help us.
|
I'm happy to help, this is not in production ;-). I'll keep this running unless it needs to be interrupted for explorations. My ancient directory is 166GB right now, so impractical to transfer. |
Thanks for your help. I was able to create a repro testcase of the behaviour, and I don't think we need your data any longer, since the root cause is identified. |
So bad data somehow got written? OK, I will wait for a new binary and start over. |
@pepa65 the answer to your question is: yes, bad data was written because of a bug in the code. We will create a new release next couple days. |
I have the same problem with you. Has your problem been solved |
@miaoqinyang I started from scratch and have been running ever since like most people. If you build yourself from master, you will get the fix. But since you already have bad data, I think you need to start over, because you probably will hit this again and again with your corrupt data. If you use release, it is very rare to get this again (but theoretically possible). |
Geth version: 1.10.2-stable, go1.16
OS & Version: Linux Mint 20 Ulyana
Commit hash : 97d11b0
Date: 20210408
It had been running for a number of days, but now it has crashed 3 times in the same day.
Run with:
--datadir=/data/Geth --cache=512 --lightkdf --ws
Output:
The text was updated successfully, but these errors were encountered: