-
Notifications
You must be signed in to change notification settings - Fork 769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attestation Queue Full due to insufficient resources #4953
Comments
Got a similar problem. The RAM usage of my lighthouse keeps increasing and then falling back to normal sharply. Are this related to issue: #4918 ? During this period, the following error information is recorded in the log: (I've only cited a few logs. Info-level logs are no different. I'm happy to provide complete logs if necessary.)
|
@SimonSMH1015 Your issue seems like it's likely related to #4918. The original issue could be too, although I also noticed that @tobidae-cb's node was affected by something like #4773. |
Yep. As Michael says, @SimonSMH1015 - You are likely related to #4918 however @tobidae-cb - The errors you are seeing are indicative of insufficient compute. The queues that are filling up there are not related to the network queues. Essentially your computer cannot process (usually do signature verification) fast enough. You are getting more messages from the network than your computer can process. This usually happens on under-resourced nodes, i find it strange that a computer with 40vCPU's to struggle here. Perhaps there's significant compute used in the slots--per--restore point other than db? Or perhaps IO is lacking and we are locking on db writes causing these queue's to build up and then spike the CPUs? It might be IO limited. |
Based on the DB stats I saw from @tobidae-cb on Discord, I think their node is affected by this bug which causes the database to grow ridiculously: #4773. Basically the database expects an invariant about blocks to hold, and if it fails, it stops being able to prune. The fix I'm thinking of applying is to just make the database log a warning and continue, as I can't see any root cause if the DB & filesystem are both upholding their guarantees. Related to that, @tobidae-cb can you confirm what type of storage you're using? i.e. filesystem, whether there is RAID, whether you're using any virtualisation/network drives. |
Hi @michaelsproul sorry for the delay, here's the system datapoints you requested. Compute Config
Related to DB and latency Are there any optimizations that can be made to the archival configurations to speed up those responses? |
Are there no other differences between the archive node and the full node? Lighthouse has a queueing system for HTTP API requests, which means that non-essential requests (like You can opt-out of that particular executor behaviour with Our long-term plan is to make all the historic requests faster (see this alpha release for a preview: https://github.com/sigp/lighthouse/releases/tag/v4.5.444-exp). That |
@tobidae-cb Out of curiosity, how many validator keys do you have configured for that node? I'm experiencing this issue too at a rate of about 30 per day so I created a PR that adds a Prometheus metric for the "overloaded" part. |
We have a dashboard for tracking these queues. It might be handy to get a better understanding of why the queues are filling up. |
I believe I have some evidence showing it's not the machine that's too slow, but it's simply the burst of incoming attestations that's so big, it overflows the To narrow things down, I'm only talking about First, the queue gets filled up in very sharp spikes that quickly go down indicating they are processed very quickly and do not sit on the queue for long, which would mean processing itself is slow: To check I'm not misinterpreting the metrics, it's a gauge and it's updated on any reprocessing queue operation so very often. What we're charting is Second, that particular node is capable of processing at least 125k unaggregated attestations per minute and during the spikes (let's take the one around midnight), it processes much less than that: Third, there's spare CPU resources around the spikes: Fourth, the issue author's machine is an absolute monster and it's hard to even saturate such hardware with a simple code, let alone something as complex as a beacon node. It'd be interesting to see how many blocks were received around the time of a queue overflow, but it looks like the Would it make sense to make EDIT: This node is running with |
It could also be that the attestation misses are caused by late blocks, which also cause delay queue saturation. I.e. the two things have a common cause. I'm afk today but can investigate more tomorrow |
@AgeManning has pointed out that |
@adaszko The size of the queue is actually already increased in 5.2.x, by this PR: Even so, we do see some queues filling up on our Holesky nodes that are subscribed to all subnets, and there is more work to be done optimising the beacon processor so that it can more efficiently utilise available hardware. Something I got @remyroy to try recently was bumping the number of workers beyond the CPU count, so that the OS scheduler can play more of a role in the efficient scheduling of work. The (hidden) flag to adjust the number of workers is |
Interesting. I was looking at the stable branch (5.2.0).
We're already committed to removing |
We sort of saw this behavior again but this time on a full-node running on an i3en.12xlarge AWS machine. Interestingly enough during the same period nodes on a i4i.8xlarge machine didn't see the same issue (same config). One thing to note is that we're running 5.2.0. We'll update to 5.2.1 shortly
|
Description
Our lighthouse container running on Kubernetes keeps having intermittent bursts of the Attestation Queue Full error. I initially tried bumping the compute to use 40vCPU and 200Gi of RAM on a i3en.24xlarge compute but that's already starting to fail.
This issue after a while clears up but as it continues to come up, it's affecting smooth operations of our Nodes. I'll provide additional details like logs, e.t.c below.
The lighthouse configuration is meant to act as an archival consensus deployment so we have
--slots-per-restore-point=32
set as a beacon node flagVersion
441fc1691b69f9edc4bbdc6665f3efab16265c9b
Present Behaviour
At a random time, our node will be syncing, then stop displaying the
New block received
message. After roughly 2 minutes, theAggregate attestation queue full
message starts displaying.About 3 minutes before this happens, there is a spike in the amount of CPU being used from 3vCPUs to 55 before settling at 40vCPU. The RAM usage also spikes from 7GiB to 50GiB (went to 84GiB but settled at 50).
While writing this message the issue self resolved, I'm guessing once the queue was cleared.
LOGS
Configuration
Compute Config
Expected Behaviour
Lighthouse should not be having this issue given that we're currently not making any API calls to the beacon endpoints, only geth is interacting with the lighthouse deployment.
Steps to resolve
The text was updated successfully, but these errors were encountered: