-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Address performance of EthGetTransactionCount #10700
Conversation
I can confirm that this fixes the performance issue (or, at least, we don't get 10s pauses). |
d63ddec
to
61cbb2c
Compare
chain/messagepool/messagepool.go
Outdated
@@ -371,7 +371,7 @@ func (ms *msgSet) toSlice() []*types.SignedMessage { | |||
func New(ctx context.Context, api Provider, ds dtypes.MetadataDS, us stmgr.UpgradeSchedule, netName dtypes.NetworkName, j journal.Journal) (*MessagePool, error) { | |||
cache, _ := lru.New2Q[cid.Cid, crypto.Signature](build.BlsSignatureCacheSize) | |||
verifcache, _ := lru.New2Q[string, struct{}](build.VerifSigCacheSize) | |||
noncecache, _ := lru.New[nonceCacheKey, uint64](256) | |||
noncecache, _ := lru.New[nonceCacheKey, uint64](32768) // 32k * ~200 bytes = 6MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to impose this memory tax on every node, when only hosted RPC nodes would truly benefit as they are public deployments? Besides that, I'm not convinced this increase is useful since the cache will be practically stale every 30 seconds? (The most popular usage pattern is getting the nonce of the pending tipset)
@@ -58,6 +60,23 @@ func (mpp *mpoolProvider) IsLite() bool { | |||
return mpp.lite != nil | |||
} | |||
|
|||
func (mpp *mpoolProvider) getActorLite(addr address.Address, ts *types.TipSet) (*types.Actor, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit easier to follow for me if refactoring like this (this only extracts the function without any functional changes, right?) is done in a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more hotspot down! 💪
We have observed that EthGetTransactionCount is one of the hotspots on Glif production notes, and we are seeing regular 10-20 second latencies when calling this rpc method. I tracked the high latency spikes and they were correlated when we were running ExecuteTipSet while following the chain. To address this, we should not rely on tipset computation to get nounce and instead look at the parent tipset and then count the messages sent from the 'addr'.
Bumped from 256 to 32k entries which should be about 6MB of cached entries given average nonceCacheKey of 200 bytes
61cbb2c
to
553da39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with https://github.com/filecoin-project/lotus/pull/10700/files#r1173643973, but it's not a high priority.
We can definitely optimize this more as discussed (e.g., iterating over messages once per tipset and caching the results) but this fixes the main performance issue.
@raulk Updated test plan with results from concurrent stress test, overall looks good! |
Fixes: #10538
Context
We have observed that
EthGetTransactionCount
is one of the hotspots on Glif production notes, and we are seeing regular 10-20 second latencies when calling this rpc method. We have also been able to replicate this issue on local nodes on our devboxes on mainnet when calling this method repeatedly.The fix
I tracked the high latency spikes and they were correlated when we were running ExecuteTipSet while following the chain.
To address this, we should not rely on tipset computation to get nounce and instead look at the parent tipset and then count the messages sent from the 'addr' as suggested by steb.
Test plan
Sequential stress test
I started and synced my lotus node on mainnet and started calling
EthGetTransactionCount
rpc method in a loop.The latency spikes seemed to be all gone, which I then confirmed by looking at the
http://localhost:1234/debug/metrics
for thegetStateNonce
metric (getnonce_ms)Of almost 35k calls, all were within 8ms (before this fix, there were often outliers in the 5+ second range)
Concurrent stress test
Same setup as with sequential stress test except that now I called
EthGetTransactionCount
usingab
in order to test calling this method in concurrently.This was the setup I used:
During this benchmark there was a single request taking 100ms but otherwise looks good: