-
-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network worker extremely busy / high event loop lag / high I/O lag #5604
Comments
@dapplion I have come up with two strategies to poll the worker to generate profiles. The first is the https://man7.org/linux/man-pages/man1/perf-record.1.html This is what I found on one of the boxes running admin 1814305 212 10.4 209601852 6904800 ? Ssl Jun01 10330:54 node ./packages/cli/bin/lodestar beacon --rcConfig /data/rcconfig.yml
admin 1814731 2.8 0.5 54947096 342896 ? Ssl Jun01 139:35 node ./packages/cli/bin/lodestar validator --rcConfig /data/rcconfig.yml
admin@beta:~/beacon$ ls /proc/1814305/task
1814305 1814318 1814319 1814320 1814321 1814322 1814323 1814325 1814326 1814327 1814328 1814335 1814338 1814339 1814340 1814341 1814342 1814343 1814344 1814345 1814346 1814347 1814348 1814349 1814352
admin@beta:~/beacon$ ls /proc/1814731/task
1814731 1814743 1814744 1814745 1814746 1814747 1814748 1814749 1814750 1814751 1814752 1814771 The second is a native library that plugs into the node inspector protocol and can generate profiles for the worker thread. This will likely not show the native system (like profiler currently) calls like network but I will attempt to plug it in and see what happens. https://github.com/hyj1991/v8-profiler-next I suppose there is a third option but its really just an extension of the first option to get the correct
the tl/dr;Just need to figure out the right tid's to poll but getting closer... |
Forwarded from discord on behalf of @dapplion: Process will be included in lodestar docs at a future date. was collected via:
the flamegraph was created via cargo install inferno
cat perf.out | inferno-collapse-perf --pid --tid --all | inferno-flamegraph > out.svg and made more readable by removing results were:
|
Attaching network thread profiles from #5628 0610_network_thread_lg1k.zip |
Found this in #5604 (comment) when we run libp2p in a separate worker thread in v1.8.0 of lodestar, it has a lot of multiple small I think it's due to more |
Yeah, seems like the network thread has long uninterrupted strands of microqueue work |
Making some headway on this after speaking with @bnoordhuis. He gave us some good breadcrumbs to follow and after pulling some more perf data its pointing to page faults as a likely cause. ~/beacon$ sudo perf stat -e "alignment-faults,bpf-output,cgroup-switches,context-switches,cpu-clock,cpu-migrations,emulation-faults,major-faults,minor-faults,page-faults,task-clock" -p $(pgrep -f '/usr/src/lodestar/packages/cli/bin/lodestar beacon') sleep 60
Without worker loop:
Performance counter stats for process id '1090':
0 alignment-faults # 0.000 /sec
0 bpf-output # 0.000 /sec
66,412 cgroup-switches # 447.230 /sec
69,557 context-switches # 468.409 /sec
148,477.87 msec cpu-clock # 2.475 CPUs utilized
2,124 cpu-migrations # 14.303 /sec
0 emulation-faults # 0.000 /sec
91 major-faults # 0.613 /sec
87,603 minor-faults # 589.933 /sec
87,694 page-faults # 590.546 /sec
148,515.02 msec task-clock # 2.475 CPUs utilized
60.001487911 seconds time elapsed
With the worker loop:
Performance counter stats for process id '31161':
0 alignment-faults # 0.000 /sec
0 bpf-output # 0.000 /sec
83,721 cgroup-switches # 443.785 /sec
94,519 context-switches # 501.023 /sec
188,626.29 msec cpu-clock # 3.144 CPUs utilized
3,892 cpu-migrations # 20.631 /sec
0 emulation-faults # 0.000 /sec
52 major-faults # 0.276 /sec
265,900 minor-faults # 1.409 K/sec
265,952 page-faults # 1.410 K/sec
188,677.76 msec task-clock # 3.145 CPUs utilized
60.001568935 seconds time elapsed The root cause of the page faults is likely GC but a resolution still needs to be determined. As a first pass to try and resolve the issue I upped the new space size as there was a considerable amount of scavenge GC happening. The new space was set at 32mb and 64 mb. There quantity of network thread scavenge leveled off at roughly 18mb so setting slightly higher than the default 16mb will have some benefit but it is not the root cause. Event loop lag was measurable down on one instance but basically unchanged on another so its mixed. Network traffic was much lower on the better performing node though. There are a few more corners to look in and I will circle back and message Ben to let him know what I found with regard to IPC channel queue and messaging lag. |
Drive-by observation: the number of cpu-migrations in the worker thread is almost twice as high. That's bad for performance (flushes cpu caches, page tables, etc.) but is probably explained by the kernel rescheduling the thread when it incurs a page fault. Did upping the new space reduce the number of page faults and migrations? |
Thanks for checking in and the SUPER quick response last week!! I still need to add the metric for IPC communication and we had a bit of a fake out with loop-lag metrics because the node that was deployed to first was seeing half the amount of network traffic as the others (just a function of the number of peers it was communicating with). I moved the branch to another cluster and it has been stabilizing for a few days now. I pulled updated Run 1 Performance counter stats for process id '838397':
0 alignment-faults # 0.000 /sec
0 bpf-output # 0.000 /sec
71,029 cgroup-switches # 623.165 /sec
72,103 context-switches # 632.588 /sec
113,959.82 msec cpu-clock # 1.899 CPUs utilized
1,956 cpu-migrations # 17.161 /sec
0 emulation-faults # 0.000 /sec
0 major-faults # 0.000 /sec
88,493 minor-faults # 776.383 /sec
88,493 page-faults # 776.383 /sec
114,002.29 msec task-clock # 1.900 CPUs utilized
60.001479677 seconds time elapsed Run 2 Performance counter stats for process id '838397':
0 alignment-faults # 0.000 /sec
0 bpf-output # 0.000 /sec
67,617 cgroup-switches # 576.157 /sec
71,075 context-switches # 605.623 /sec
117,337.69 msec cpu-clock # 1.956 CPUs utilized
2,192 cpu-migrations # 18.678 /sec
0 emulation-faults # 0.000 /sec
0 major-faults # 0.000 /sec
71,858 minor-faults # 612.295 /sec
71,858 page-faults # 612.295 /sec
117,379.38 msec task-clock # 1.956 CPUs utilized
60.001200425 seconds time elapsed The event loop lag is still pretty high though but the network response times have come down a bit. I have an unrelated branch that I am near complete working on and then I was going to switch back over and add the worker/main thread communication metrics so we can see what is cooking there. As a note I was able to confirm that we have admin@feat2~$ cd /sys/fs/cgroup/
admin@feat2:/sys/fs/cgroup$ ls
cgroup.controllers cgroup.subtree_control cpu.stat io.cost.qos memory.pressure sys-kernel-debug.mount
cgroup.max.depth cgroup.threads dev-hugepages.mount io.pressure memory.stat sys-kernel-tracing.mount
cgroup.max.descendants cpu.pressure dev-mqueue.mount io.prio.class misc.capacity system.slice
cgroup.procs cpuset.cpus.effective init.scope io.stat sys-fs-fuse-connections.mount user.slice
cgroup.stat cpuset.mems.effective io.cost.model memory.numa_stat sys-kernel-config.mount
admin@feat2:/sys/fs/cgroup$ cat cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc |
cpu is reduced from > 300% to ~200% on a mainnet node in #5747 (comment) . I see better event loop lag there vs unstable |
After looking at benchmarks of The drawbacks of child processes mentioned in https://github.com/orgs/nodejs/discussions/44264#discussioncomment-3428883 don't really apply in our case since we spin up the network worker just once and we are not taking advantage of shared memory. But the drawbacks of worker_threads might cause the issues we are seeing
Another good argument for using a child process is mentioned in piscinajs/piscina#81 (comment), it would provide better DOS protection for the main thread which is one of the main goals mentioned in #5447. |
Problem description
General issue to recap observations of enabling the network worker on high load nodes.
On June 25th network worker is enabled on Goerli on the unstable group nodes. The lg1k node with 1000 validators connected, shows bad network performance with long response times and other metrics indicating very slow network handling.
On the week of June 25th we deploy network worker on feat1 group nodes, turning the network worker on and off (moving network stack to the main thread). The metrics below capture
Below, active handles in pink = network worker on
Note there's an unrelated memory leak going on that's being addressed in another issue
feat1-mainnet node running on mainnet with --subscribe-all-subnets, no validators attached, version v1.8.0/888c599, timeframe: 5/24 till 5/30
The biggest difference between using the network worker is the huge increase in outbound network traffic. Outbound gossip traffic jumped from 0.3MB/s to 2MB/s (~6x). Before this node dropped 50% of incoming attestations, now process everything relatively timely.
feat1-mainnet node running on mainnet with --subscribe-all-subnets, no validators attached, version v1.8.0/888c599, timeframe: 5/27 till 6/02
confirmed with system metrics
Questions
Solution description
--
Additional context
No response
The text was updated successfully, but these errors were encountered: