Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network worker extremely busy / high event loop lag / high I/O lag #5604

Closed
dapplion opened this issue Jun 2, 2023 · 11 comments · Fixed by #5800, #5896 or #5898
Closed

Network worker extremely busy / high event loop lag / high I/O lag #5604

dapplion opened this issue Jun 2, 2023 · 11 comments · Fixed by #5800, #5896 or #5898
Labels
meta-feature-request Issues to track feature requests.
Milestone

Comments

@dapplion
Copy link
Contributor

dapplion commented Jun 2, 2023

Problem description

General issue to recap observations of enabling the network worker on high load nodes.

On June 25th network worker is enabled on Goerli on the unstable group nodes. The lg1k node with 1000 validators connected, shows bad network performance with long response times and other metrics indicating very slow network handling.

On the week of June 25th we deploy network worker on feat1 group nodes, turning the network worker on and off (moving network stack to the main thread). The metrics below capture

  • after enabling the worker the main thread got less busy (green line decreasing)
  • network worker is extremely busy, causing bad network behaviour

Below, active handles in pink = network worker on
Note there's an unrelated memory leak going on that's being addressed in another issue
feat1-mainnet node running on mainnet with --subscribe-all-subnets, no validators attached, version v1.8.0/888c599, timeframe: 5/24 till 5/30

image1
image2
image3

The biggest difference between using the network worker is the huge increase in outbound network traffic. Outbound gossip traffic jumped from 0.3MB/s to 2MB/s (~6x). Before this node dropped 50% of incoming attestations, now process everything relatively timely.

feat1-mainnet node running on mainnet with --subscribe-all-subnets, no validators attached, version v1.8.0/888c599, timeframe: 5/27 till 6/02

image

confirmed with system metrics

image

Questions

  • What is the network worker doing that chokes so much?
  • Why is the overall load of the node up so much?
  • Is there any overhead of having most OS I/O interactions in a worker?

Solution description

--

Additional context

No response

@dapplion dapplion added the meta-feature-request Issues to track feature requests. label Jun 2, 2023
@matthewkeil
Copy link
Member

matthewkeil commented Jun 5, 2023

@dapplion I have come up with two strategies to poll the worker to generate profiles.

The first is the perf record command with the --tid=*** flag to target the worker thread (and its libuv children). There is also --per-thread flag to create separate mmap per thread but I am not sure how this will play in with the V8 output until I pull the trigger a few more times because its not working as expected so need to figure out why. I am playing with it now to see what is coming out of it and will report back as I know more.

https://man7.org/linux/man-pages/man1/perf-record.1.html

This is what I found on one of the boxes running v1.9.0-rc.0 so def looks like its gonna be one of the second batch on beacon.

admin   1814305  212 10.4 209601852 6904800 ?   Ssl  Jun01 10330:54 node ./packages/cli/bin/lodestar beacon --rcConfig /data/rcconfig.yml
admin   1814731  2.8  0.5 54947096 342896 ?     Ssl  Jun01 139:35 node ./packages/cli/bin/lodestar validator --rcConfig /data/rcconfig.yml
admin@beta:~/beacon$ ls /proc/1814305/task
1814305  1814318  1814319  1814320  1814321  1814322  1814323  1814325  1814326  1814327  1814328  1814335  1814338  1814339  1814340  1814341  1814342  1814343  1814344  1814345  1814346  1814347  1814348  1814349  1814352
admin@beta:~/beacon$ ls /proc/1814731/task
1814731  1814743  1814744  1814745  1814746  1814747  1814748  1814749  1814750  1814751  1814752  1814771

The second is a native library that plugs into the node inspector protocol and can generate profiles for the worker thread. This will likely not show the native system (like profiler currently) calls like network but I will attempt to plug it in and see what happens.

https://github.com/hyj1991/v8-profiler-next

I suppose there is a third option but its really just an extension of the first option to get the correct tid from the runtime.

worker.threadId is available through the standard node api and that looks like it will return the uv_thread_t which is a typedef for the p_thread so should correlate correctly with on of the tasks in /proc/<PID/task dir theoretically. If the above does not work I will push a branch that outputs that as a last resort but we should be able to get that right from the system without having to ask node...

the tl/dr;

Just need to figure out the right tid's to poll but getting closer...

@matthewkeil
Copy link
Member

🎉 HOORAY!! Got a worker thread flamegraph.

Something isn't correct but its definitely stating to "work". Couple more tries and I think I will have it working as expected

Screen Shot 2023-06-06 at 4 58 10 AM

@matthewkeil
Copy link
Member

Forwarded from discord on behalf of @dapplion:

Process will be included in lodestar docs at a future date.

was collected via:

sudo perf record -F 999 -p $(pgrep -f '/usr/src/lodestar/packages/cli/bin/lodestar beacon') -g -- sleep 60
and processed with sudo perf script -F +pid,+tid -f > perf.out

the flamegraph was created via inferno with the following commands

cargo install inferno
cat perf.out | inferno-collapse-perf --pid --tid --all | inferno-flamegraph > out.svg

and made more readable by removing font-family:monospace from the svg source

results were:

thread ID 5905 takes 48.2% of total CPU time there a bunch of overhead with:
- libc.so: 3% @g11tech this is a linux lib for linking?
- epoll: 0.8% (poll i/o)
- garbage collection: 1.2%
- OS tcp handling: 1%

JS stacks start with 36.7% of total CPU, of which:
- mplex: 4.2%
- gossipsub: rest

Looking into depth on the stacks:
- noise: 1.5%
- as-sha-256: 1.9%
- MessagePort.PostMessage: 1.4%

threaded_flamegraph

out

@twoeths
Copy link
Contributor

twoeths commented Jun 12, 2023

Attaching network thread profiles from #5628

0610_network_thread_lg1k.zip
0610_network_thread_mainnet.zip

@twoeths
Copy link
Contributor

twoeths commented Jun 12, 2023

Found this in #5604 (comment) when we run libp2p in a separate worker thread

Screenshot 2023-06-12 at 17 13 29

in v1.8.0 of lodestar, it has a lot of multiple small runMicroTasks (80_041) while way fewer/big runMicroTasks (8_978) in network thread version

I think it's due to more sleep(0) call in v1.8.0, we need to add more calls like that in network thread version

@wemeetagain
Copy link
Member

Yeah, seems like the network thread has long uninterrupted strands of microqueue work

@matthewkeil
Copy link
Member

Making some headway on this after speaking with @bnoordhuis. He gave us some good breadcrumbs to follow and after pulling some more perf data its pointing to page faults as a likely cause.

~/beacon$ sudo perf stat -e "alignment-faults,bpf-output,cgroup-switches,context-switches,cpu-clock,cpu-migrations,emulation-faults,major-faults,minor-faults,page-faults,task-clock" -p $(pgrep -f '/usr/src/lodestar/packages/cli/bin/lodestar beacon') sleep 60

Without worker loop:
 Performance counter stats for process id '1090':

                 0      alignment-faults          #    0.000 /sec                   
                 0      bpf-output                #    0.000 /sec                   
            66,412      cgroup-switches           #  447.230 /sec                   
            69,557      context-switches          #  468.409 /sec                   
        148,477.87 msec cpu-clock                 #    2.475 CPUs utilized          
             2,124      cpu-migrations            #   14.303 /sec                   
                 0      emulation-faults          #    0.000 /sec                   
                91      major-faults              #    0.613 /sec                   
            87,603      minor-faults              #  589.933 /sec                   
            87,694      page-faults               #  590.546 /sec                   
        148,515.02 msec task-clock                #    2.475 CPUs utilized          

      60.001487911 seconds time elapsed

With the worker loop:
 Performance counter stats for process id '31161':

                 0      alignment-faults          #    0.000 /sec                   
                 0      bpf-output                #    0.000 /sec                   
            83,721      cgroup-switches           #  443.785 /sec                   
            94,519      context-switches          #  501.023 /sec                   
        188,626.29 msec cpu-clock                 #    3.144 CPUs utilized          
             3,892      cpu-migrations            #   20.631 /sec                   
                 0      emulation-faults          #    0.000 /sec                   
                52      major-faults              #    0.276 /sec                   
           265,900      minor-faults              #    1.409 K/sec                  
           265,952      page-faults               #    1.410 K/sec                  
        188,677.76 msec task-clock                #    3.145 CPUs utilized          

      60.001568935 seconds time elapsed

The root cause of the page faults is likely GC but a resolution still needs to be determined. As a first pass to try and resolve the issue I upped the new space size as there was a considerable amount of scavenge GC happening. The new space was set at 32mb and 64 mb.

There quantity of network thread scavenge leveled off at roughly 18mb so setting slightly higher than the default 16mb will have some benefit but it is not the root cause.

Screenshot 2023-07-11 at 5 24 25 AM

Event loop lag was measurable down on one instance but basically unchanged on another so its mixed. Network traffic was much lower on the better performing node though. There are a few more corners to look in and I will circle back and message Ben to let him know what I found with regard to IPC channel queue and messaging lag.

memory-defaults new-space-32mb

@bnoordhuis
Copy link

Drive-by observation: the number of cpu-migrations in the worker thread is almost twice as high.

That's bad for performance (flushes cpu caches, page tables, etc.) but is probably explained by the kernel rescheduling the thread when it incurs a page fault.

Did upping the new space reduce the number of page faults and migrations?

@matthewkeil
Copy link
Member

Thanks for checking in and the SUPER quick response last week!! I still need to add the metric for IPC communication and we had a bit of a fake out with loop-lag metrics because the node that was deployed to first was seeing half the amount of network traffic as the others (just a function of the number of peers it was communicating with).

I moved the branch to another cluster and it has been stabilizing for a few days now. I pulled updated stat and it does seem like like the minor and page faults are way down and near that of the non-worker case. 🎉 The metrics are showing that we scavenge roughly 17mb on the worker so bumping the new space was probably prudent. I also did a test bumping to 64mb but there was no difference from the 32 so we have probably raised it enough to cover the workload.

Run 1

Performance counter stats for process id '838397':

                 0      alignment-faults          #    0.000 /sec                   
                 0      bpf-output                #    0.000 /sec                   
            71,029      cgroup-switches           #  623.165 /sec                   
            72,103      context-switches          #  632.588 /sec                   
        113,959.82 msec cpu-clock                 #    1.899 CPUs utilized          
             1,956      cpu-migrations            #   17.161 /sec                   
                 0      emulation-faults          #    0.000 /sec                   
                 0      major-faults              #    0.000 /sec                   
            88,493      minor-faults              #  776.383 /sec                   
            88,493      page-faults               #  776.383 /sec                   
        114,002.29 msec task-clock                #    1.900 CPUs utilized          

      60.001479677 seconds time elapsed

Run 2

Performance counter stats for process id '838397':

                 0      alignment-faults          #    0.000 /sec                   
                 0      bpf-output                #    0.000 /sec                   
            67,617      cgroup-switches           #  576.157 /sec                   
            71,075      context-switches          #  605.623 /sec                   
        117,337.69 msec cpu-clock                 #    1.956 CPUs utilized          
             2,192      cpu-migrations            #   18.678 /sec                   
                 0      emulation-faults          #    0.000 /sec                   
                 0      major-faults              #    0.000 /sec                   
            71,858      minor-faults              #  612.295 /sec                   
            71,858      page-faults               #  612.295 /sec                   
        117,379.38 msec task-clock                #    1.956 CPUs utilized          

      60.001200425 seconds time elapsed

The event loop lag is still pretty high though but the network response times have come down a bit.

I have an unrelated branch that I am near complete working on and then I was going to switch back over and add the worker/main thread communication metrics so we can see what is cooking there.

As a note I was able to confirm that we have cgroupv2 on our machines.

admin@feat2~$ cd /sys/fs/cgroup/
admin@feat2:/sys/fs/cgroup$ ls
cgroup.controllers      cgroup.subtree_control  cpu.stat             io.cost.qos       memory.pressure                sys-kernel-debug.mount
cgroup.max.depth        cgroup.threads          dev-hugepages.mount  io.pressure       memory.stat                    sys-kernel-tracing.mount
cgroup.max.descendants  cpu.pressure            dev-mqueue.mount     io.prio.class     misc.capacity                  system.slice
cgroup.procs            cpuset.cpus.effective   init.scope           io.stat           sys-fs-fuse-connections.mount  user.slice
cgroup.stat             cpuset.mems.effective   io.cost.model        memory.numa_stat  sys-kernel-config.mount
admin@feat2:/sys/fs/cgroup$ cat cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

@philknows philknows modified the milestones: v1.10.0, v1.11.0 Jul 13, 2023
@twoeths
Copy link
Contributor

twoeths commented Jul 20, 2023

Drive-by observation: the number of cpu-migrations in the worker thread is almost twice as high.

cpu is reduced from > 300% to ~200% on a mainnet node in #5747 (comment) . I see better event loop lag there

Screenshot 2023-07-20 at 09 11 55

vs unstable

Screenshot 2023-07-20 at 09 12 25

@nflaig
Copy link
Member

nflaig commented Aug 5, 2023

After looking at benchmarks of worker_threads vs child_process.fork here https://github.com/orgs/nodejs/discussions/44264, I really wonder if it is good to use a worker_thread for the network.

The drawbacks of child processes mentioned in https://github.com/orgs/nodejs/discussions/44264#discussioncomment-3428883 don't really apply in our case since we spin up the network worker just once and we are not taking advantage of shared memory.

But the drawbacks of worker_threads might cause the issues we are seeing

the former (worker_thread) will use a CPU slice in the process schedule while the later (child_process) practically gets a better CPU share, being a full process.

Another good argument for using a child process is mentioned in piscinajs/piscina#81 (comment), it would provide better DOS protection for the main thread which is one of the main goals mentioned in #5447.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-feature-request Issues to track feature requests.
Projects
None yet
7 participants