[core] Race condition on diffLayer #22540

fxfactorial · 2021-03-21T02:00:33Z

I encountered this race condition and it happens at difflayer.go:223 wrt dl.origin = origin but .Storage uses dl.origin at return dl.origin.Storage(accountHash, storageHash) in difflayer.go.

holiman · 2021-03-21T10:02:52Z

I encountered this race condition

Could you provide some more info on how you encountered it? Do you have a stack trace? Was it during a test?

Your change does two things, turning an RLock to a Lock, and changing the scope. I don't see the need for either of them, since both 1) storage and Storage internally obtains the Rlock, and 2) the internals are not modified, so readlock should suffice.

So any more info about how you encountered this would likely clear this up for me

fxfactorial · 2021-03-21T12:46:44Z

@holiman Sorry - was late and I was too brief.

I found it the usual way, with -race turned on and one thread hit a call to rebloom https://github.com/ethereum/go-ethereum/blob/70a8d2cbacae6378e0da73097035c27a8114672f/core/state/snapshot/difflayer.go#L223 which set the .origin field, but another thread hit the the call to .Storage which uses the .origin field.

So

the lock doesn't protect the usage of the field .origin
a read lock isn't enough since rebloom will set it

I'll try to look through tmux history for the -race stack traces

karalabe · 2021-03-22T07:18:08Z

The PR is definitely problematic because it serializes reads in the spanshots, and even keeps it locked for disk access.

If the underlying issue is the .origin, we can work around that by extracting it while still in the read lock:

	// Check the bloom filter first whether there's even a point in reaching into
	// all the maps in all the layers below
	dl.lock.RLock()
	hit := dl.diffed.Contains(storageBloomHasher{accountHash, storageHash})
	if !hit {
		hit = dl.diffed.Contains(destructBloomHasher(accountHash))
	}
	var origin *diskLayer
	if !hit {
		origin = dl.origin // extract origin while holding the lock
	}
	dl.lock.RUnlock()

	// If the bloom filter misses, don't even bother with traversing the memory
	// diff layers, reach straight into the bottom persistent disk layer
	if origin != nil {
		snapshotBloomStorageMissMeter.Mark(1)
		return origin.Storage(accountHash, storageHash)
	}
	// The bloom filter hit, start poking in the internal maps
	return dl.storage(accountHash, storageHash, 0)

Would this solve the issue @fxfactorial?

karalabe · 2021-03-22T07:19:26Z

Though I guess we'd need to look through the code now, because account and whatnot accessors will use the same patterns as the faulty storage above.

We definitely need the same fix in AccountRLP too.

karalabe · 2021-03-22T07:23:46Z

I think that would suffice. There are 1-2 more accesses into .origin via parent.origin paths, but those are snapshot mutation operations, and I think we only ever mutate serialized.

fxfactorial · 2021-03-22T18:02:46Z

The PR is definitely problematic because it serializes reads in the spanshots, and even keeps it locked for disk access.

If the underlying issue is the .origin, we can work around that by extracting it while still in the read lock:

	// Check the bloom filter first whether there's even a point in reaching into
	// all the maps in all the layers below
	dl.lock.RLock()
	hit := dl.diffed.Contains(storageBloomHasher{accountHash, storageHash})
	if !hit {
		hit = dl.diffed.Contains(destructBloomHasher(accountHash))
	}
	var origin *diskLayer
	if !hit {
		origin = dl.origin // extract origin while holding the lock
	}
	dl.lock.RUnlock()

	// If the bloom filter misses, don't even bother with traversing the memory
	// diff layers, reach straight into the bottom persistent disk layer
	if origin != nil {
		snapshotBloomStorageMissMeter.Mark(1)
		return origin.Storage(accountHash, storageHash)
	}
	// The bloom filter hit, start poking in the internal maps
	return dl.storage(accountHash, storageHash, 0)

Would this solve the issue @fxfactorial?

yes - i think so - I can clean up the other spots if you like as well (lmk where to look, i see you mentioned some places)

holiman · 2021-03-29T09:12:48Z

@fxfactorial Do you want to fix this? Would be nice to get it merged.
I think you can leave the other spots out, unless you find something that looks suspicious to you

fxfactorial · 2021-03-29T12:45:29Z

@holiman force pushed - covered the AccountRLP method as well -

holiman

LGTM, but please remove the iterate.sh, probably an accidental addition :)

holiman

LGTM, thanks!

fjl · 2021-03-30T08:43:09Z

@karalabe Please merge if you think this fix is OK.

fxfactorial · 2021-04-03T13:37:02Z

@karalabe ping - anything else for merge?

karalabe

SGTM

Cherry pick bug fixes from upstream for snapshots, which will enable higher transaction throughput. It also enables snapshots by default (which is one of the commits pulled from upstream). Upstream commits included: 68754f3 cmd/utils: grant snapshot cache to trie if disabled (ethereum#21416) 3ee91b9 core/state/snapshot: reduce disk layer depth during generation a15d71a core/state/snapshot: stop generator if it hits missing trie nodes (ethereum#21649) 43c278c core/state: disable snapshot iteration if it's not fully constructed (ethereum#21682) b63e3c3 core: improve snapshot journal recovery (ethereum#21594) e640267 core/state/snapshot: fix journal recovery from generating old journal (ethereum#21775) 7b7b327 core/state/snapshot: update generator marker in sync with flushes 167ff56 core/state/snapshot: gethring -> gathering typo (ethereum#22104) d2e1b17 snapshot, trie: fixed typos, mostly in snapshot pkg (ethereum#22133) c4deebb core/state/snapshot: add generation logs to storage too 5e9f5ca core/state/snapshot: write snapshot generator in batch (ethereum#22163) 18145ad core/state: maintain one more diff layer (ethereum#21730) 04a7226 snapshot: merge loops for better performance (ethereum#22160) 994cdc6 cmd/utils: enable snapshots by default 9ec3329 core/state/snapshot: ensure Cap retains a min number of layers 52e5c38 core/state: copy the snap when copying the state (ethereum#22340) a31f6d5 core/state/snapshot: fix panic on missing parent 61ff3e8 core/state/snapshot, ethdb: track deletions more accurately (ethereum#22582) c79fc20 core/state/snapshot: fix data race in diff layer (ethereum#22540) Other changes Commit f9b5530 (not from upstream) fixes an incorrect default DatabaseCache value due to an earlier bad merge. Tested Automated tests Testing on a private testnet Backwards compatibility Enabling snapshots by default is a breaking change in terms of the CLI flags, but will not cause backwards incompatibility between the node and other nodes. Co-authored-by: Péter Szilágyi <peterke@gmail.com> Co-authored-by: gary rong <garyrong0905@gmail.com> Co-authored-by: Melvin Junhee Woo <melvin.woo@groundx.xyz> Co-authored-by: Martin Holst Swende <martin@swende.se> Co-authored-by: Edgar Aroutiounian <edgar.factorial@gmail.com>

fxfactorial requested review from holiman, karalabe and rjl493456442 as code owners March 21, 2021 02:00

fxfactorial force-pushed the snapshot-race branch from 40700cc to 70a8d2c Compare March 21, 2021 03:21

fxfactorial force-pushed the snapshot-race branch from 70a8d2c to 671a360 Compare March 29, 2021 12:44

holiman reviewed Mar 29, 2021

View reviewed changes

fxfactorial force-pushed the snapshot-race branch from 671a360 to 71e8df6 Compare March 29, 2021 13:23

holiman approved these changes Mar 30, 2021

View reviewed changes

holiman added the status:triage label Mar 30, 2021

fjl assigned karalabe Mar 30, 2021

fjl removed the status:triage label Mar 30, 2021

[snapshot] Extract origin while holding lock

7e2d908

fxfactorial force-pushed the snapshot-race branch from 71e8df6 to 7e2d908 Compare March 30, 2021 19:03

karalabe approved these changes Apr 6, 2021

View reviewed changes

karalabe added this to the 1.10.2 milestone Apr 6, 2021

karalabe merged commit c79fc20 into ethereum:master Apr 6, 2021

fxfactorial deleted the snapshot-race branch April 6, 2021 11:45

atif-konasl pushed a commit to frozeman/pandora-execution-engine that referenced this pull request Oct 15, 2021

core/state/snapshot: fix data race in diff layer (ethereum#22540)

2b10fc3

quorumbot mentioned this pull request Apr 28, 2022

[Upgrade] Go-Ethereum release v1.10.2 Consensys/quorum#1391

Merged

9 tasks

This was referenced Sep 23, 2022

Metadium to master METADIUM/go-metadium#24

Closed

Metadium to master METADIUM/go-metadium#25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Race condition on diffLayer #22540

[core] Race condition on diffLayer #22540

fxfactorial commented Mar 21, 2021

holiman commented Mar 21, 2021

fxfactorial commented Mar 21, 2021

karalabe commented Mar 22, 2021

karalabe commented Mar 22, 2021 •

edited

Loading

karalabe commented Mar 22, 2021

fxfactorial commented Mar 22, 2021

holiman commented Mar 29, 2021

fxfactorial commented Mar 29, 2021

holiman left a comment

holiman left a comment

fjl commented Mar 30, 2021

fxfactorial commented Apr 3, 2021

karalabe left a comment

[core] Race condition on diffLayer #22540

[core] Race condition on diffLayer #22540

Conversation

fxfactorial commented Mar 21, 2021

holiman commented Mar 21, 2021

fxfactorial commented Mar 21, 2021

karalabe commented Mar 22, 2021

karalabe commented Mar 22, 2021 • edited Loading

karalabe commented Mar 22, 2021

fxfactorial commented Mar 22, 2021

holiman commented Mar 29, 2021

fxfactorial commented Mar 29, 2021

holiman left a comment

Choose a reason for hiding this comment

holiman left a comment

Choose a reason for hiding this comment

fjl commented Mar 30, 2021

fxfactorial commented Apr 3, 2021

karalabe left a comment

Choose a reason for hiding this comment

karalabe commented Mar 22, 2021 •

edited

Loading