Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightning_hsmd at 100% CPU on cln 24.08 mainnet #7655

Closed
daywalker90 opened this issue Sep 10, 2024 · 9 comments · Fixed by #7661
Closed

lightning_hsmd at 100% CPU on cln 24.08 mainnet #7655

daywalker90 opened this issue Sep 10, 2024 · 9 comments · Fixed by #7661
Assignees
Milestone

Comments

@daywalker90
Copy link
Contributor

I've seen this multiple times now. lightning_hsmd sits at 100% CPU and i have to restart cln to get it back to normal. Strangely this is not happening with my testnet nodes running on the same machine and the same binaries.

@Impa10r
Copy link

Impa10r commented Sep 10, 2024

Concur. But I also see both lightning_hsmd and lightning_gossi keep running at 100% after I shut down lightingd:

$ sudo systemctl stop lightningd
$ sudo systemctl status lightningd
○ lightningd.service - Core Lightning daemon
     Loaded: loaded (/etc/systemd/system/lightningd.service; disabled; vendor preset: enabled)
     Active: inactive (dead) since Tue 2024-09-10 15:14:36 CEST; 5min ago
    Process: 17183 ExecStart=/usr/bin/lightningd (code=killed, signal=TERM)
   Main PID: 17183 (code=killed, signal=TERM)
      Tasks: 2 (limit: 9438)
     Memory: 101.4M
        CPU: 19min 34.980s
     CGroup: /system.slice/lightningd.service
             ├─17252 /usr/bin/../libexec/c-lightning/lightning_hsmd
             └─17278 /usr/bin/../libexec/c-lightning/lightning_gossipd

sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17224 (sql) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17226 (bookkeeper) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17230 (peerswap) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17252 (lightning_hsmd) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17278 (lightning_gossi) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 73560 (circular) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: Stopped Core Lightning daemon.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Consumed 8min 33.185s CPU time.
sep 10 15:14:36 vlad-VirtualBox lightningd[17215]: Lost connection to the RPC socket.
sep 10 15:14:36 vlad-VirtualBox lightningd[17220]: Lost connection to the RPC socket.

$ top
top - 15:18:08 up  3:06,  2 users,  load average: 3,62, 2,47, 1,64
Tasks: 295 total,   5 running, 288 sleeping,   0 stopped,   2 zombie
%Cpu(s):  0,4 us,  0,5 sy,  0,0 ni, 99,1 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
MiB Mem :   7941,8 total,   3416,4 free,    827,8 used,   3697,6 buff/cache
MiB Swap:   2048,0 total,   1215,6 free,    832,4 used.   6816,5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  17252 vlad      20   0    5620   3328   3328 R 100,0   0,0   3:41.55 lightning_hsmd
  17278 vlad      20   0   20148   8072   3592 R 100,0   0,1   3:33.24 lightning_gossi
      1 root      20   0  167692  12244   8148 S   0,0   0,2   0:13.00 systemd
      2 root      20   0       0      0      0 S   0,0   0,0   0:00.05 kthreadd
      3 root      20   0       0      0      0 S   0,0   0,0   0:00.00 pool_workqueue_release
      4 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/R-rcu_g
      5 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/R-rcu_p
      6 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/R-slub_
      7 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/R-netns

They prevent my Ubuntu 22 from shutting down!

@rustyrussell
Copy link
Contributor

Hmmmm, the best way to debug this is:

sudo perf record --call-graph=dwarf -p `pidof lightning_hsmd`

Then sudo perf report will let you explore what is taking all the time.

Obviously this only works when hsmd is consuming lots of CPU.

Without knowing what it's doing, this is hard to debug! Presumably something is hammering it with requests: if we can figure out what the request is, that might give us a clue!

@junity42
Copy link

Same problem here, one cpu 100% load by lightning_hsmd and sometimes gossip too. I have to manually kill the stuck processes, it won't stop when I simply restart clightning process with systemctl.
I've run the perf commands, here is the result, I hope it helps:
Screenshot from 2024-09-11 08-23-56

@daywalker90
Copy link
Contributor Author

image

@rustyrussell
Copy link
Contributor

Thanks! Ok, one more request, what does sudo strace -p $(pidof lightning_hsmd) -o /tmp/trace give? It will run forever, so just let it run for a few seconds then Control-C.

From this report, and another one, it's slamming the read/poll loop. I might have to add some debug for this. It doesn't happen for me :(

@junity42
Copy link

My instance is running on a rpi4, the 100% cpu bug happens after approximately half a day of ok running, every day it seems.
I run the strace, results are attached.
trace.txt

@daywalker90
Copy link
Contributor Author

daywalker90 commented Sep 12, 2024

Another one with both commands:

image

trace.zip

@grubles
Copy link
Contributor

grubles commented Sep 12, 2024

Noticed my lightning_hsmd was running at 100% today.

Samples: 204K of event 'cycles:P', Event count (approx.): 192624081792                                 
  Children      Self  Command         Shared Object      Symbol                                        
+   99.80%     0.00%  lightning_hsmd  libc.so.6          [.] __libc_start_main@@GLIBC_2.34             
+   99.80%     0.00%  lightning_hsmd  libc.so.6          [.] __libc_start_call_main                    
+   99.80%     0.00%  lightning_hsmd  lightning_hsmd     [.] main                                      
+   99.80%     1.03%  lightning_hsmd  lightning_hsmd     [.] io_loop                                   
+   95.34%     1.38%  lightning_hsmd  [kernel.kallsyms]  [k] system_call_vectored_common               
+   93.92%     1.60%  lightning_hsmd  [kernel.kallsyms]  [k] system_call_exception                     
+   89.70%     0.29%  lightning_hsmd  lightning_hsmd     [.] daemon_poll                               
+   88.26%     0.50%  lightning_hsmd  libc.so.6          [.] __poll                                    
+   85.50%     0.12%  lightning_hsmd  [kernel.kallsyms]  [k] sys_poll                                  
+   85.36%     6.22%  lightning_hsmd  [kernel.kallsyms]  [k] do_sys_poll                               
+   42.76%    13.25%  lightning_hsmd  [kernel.kallsyms]  [k] sock_poll                                 
+   32.89%     0.00%  lightning_hsmd  [unknown]          [k] 0000000000000000                          
+   31.10%    14.36%  lightning_hsmd  [kernel.kallsyms]  [k] unix_poll                                 
+   24.14%     0.36%  lightning_hsmd  [kernel.kallsyms]  [k] poll_freewait                             
+   16.69%     6.36%  lightning_hsmd  [kernel.kallsyms]  [k] __pollwait                                
+   15.77%    15.76%  lightning_hsmd  [kernel.kallsyms]  [k] fput                                      
+   10.78%    10.78%  lightning_hsmd  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave                    
+    8.86%     0.12%  lightning_hsmd  lightning_hsmd     [.] io_ready                                  
+    8.86%     0.18%  lightning_hsmd  lightning_hsmd     [.] do_plan                                   
+    8.33%     0.25%  lightning_hsmd  lightning_hsmd     [.] do_read_wire                              
+    8.13%     0.21%  lightning_hsmd  lightning_hsmd     [.] do_read_wire_header                       
+    7.87%     0.46%  lightning_hsmd  libc.so.6          [.] read                                      
+    7.33%     0.39%  lightning_hsmd  [kernel.kallsyms]  [k] remove_wait_queue                         
+    6.51%     0.63%  lightning_hsmd  [kernel.kallsyms]  [k] add_wait_queue                            
+    5.33%     5.33%  lightning_hsmd  [kernel.kallsyms]  [k] __fget_light                              
+    5.20%     0.22%  lightning_hsmd  [kernel.kallsyms]  [k] ksys_read                                 
+    5.16%     0.07%  lightning_hsmd  [kernel.kallsyms]  [k] get_free_pages_noprof                     
+    4.92%     0.05%  lightning_hsmd  [kernel.kallsyms]  [k] alloc_pages_mpol_noprof                   
+    4.73%     0.22%  lightning_hsmd  [kernel.kallsyms]  [k] vfs_read                                  
+    4.39%     0.15%  lightning_hsmd  [kernel.kallsyms]  [k] __alloc_pages_noprof                      
+    4.13%     0.64%  lightning_hsmd  [kernel.kallsyms]  [k] get_page_from_freelist                    
+    3.15%     0.09%  lightning_hsmd  [kernel.kallsyms]  [k] post_alloc_hook                           
+    3.07%     3.06%  lightning_hsmd  [kernel.kallsyms]  [k] kernel_init_pages                         
+    2.83%     0.54%  lightning_hsmd  [kernel.kallsyms]  [k] arch_local_irq_restore                    
+    2.64%     0.22%  lightning_hsmd  [kernel.kallsyms]  [k] sock_read_iter                            

rustyrussell added a commit to rustyrussell/lightning that referenced this issue Sep 13, 2024
Our low-level ccan/io IO routines return three values:
-1: error.
0: call me again, I'm not finished.
1: I'm done, go onto the next thing.

In the last release, we tweaked the sematics of "-1": we now opportunistically
call a routine which returns 0 once more, in case there's more data.  We use errno to
distinguish between "EAGAIN" which means there wasn't any data, and real errors.

However, if the underlying read() returns 0 (which it does when the peer has closed
the other end) the value of errno is UNDEFINED.  If it happens to be EAGAIN, we will
call it again, rather than closing.  This causes us to spin: in particular people reported
hsmd consuming 100% of CPU.

The ccan/io read code handled this by setting errno to 0 in this case, but our own
wire low-level routines *did not*.

Fixes: ElementsProject#7655
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell
Copy link
Contributor

OK, this did it! I found the case where we get a 0 read and ignore EOF. Thankyou, everyone, for your patience: I've added this to the v24.08.1 milestone, which will obviously need to be released soon.

rustyrussell added a commit to rustyrussell/lightning that referenced this issue Sep 13, 2024
Our low-level ccan/io IO routines return three values:
-1: error.
0: call me again, I'm not finished.
1: I'm done, go onto the next thing.

In the last release, we tweaked the sematics of "-1": we now opportunistically
call a routine which returns 0 once more, in case there's more data.  We use errno to
distinguish between "EAGAIN" which means there wasn't any data, and real errors.

However, if the underlying read() returns 0 (which it does when the peer has closed
the other end) the value of errno is UNDEFINED.  If it happens to be EAGAIN, we will
call it again, rather than closing.  This causes us to spin: in particular people reported
hsmd consuming 100% of CPU.

The ccan/io read code handled this by setting errno to 0 in this case, but our own
wire low-level routines *did not*.

Fixes: ElementsProject#7655
Changelog-Fixed: Fixed intermittant bug where hsmd (particularly, but also lightningd) could use 100% CPU.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
ShahanaFarooqui pushed a commit to ShahanaFarooqui/lightning that referenced this issue Sep 17, 2024
Our low-level ccan/io IO routines return three values:
-1: error.
0: call me again, I'm not finished.
1: I'm done, go onto the next thing.

In the last release, we tweaked the sematics of "-1": we now opportunistically
call a routine which returns 0 once more, in case there's more data.  We use errno to
distinguish between "EAGAIN" which means there wasn't any data, and real errors.

However, if the underlying read() returns 0 (which it does when the peer has closed
the other end) the value of errno is UNDEFINED.  If it happens to be EAGAIN, we will
call it again, rather than closing.  This causes us to spin: in particular people reported
hsmd consuming 100% of CPU.

The ccan/io read code handled this by setting errno to 0 in this case, but our own
wire low-level routines *did not*.

Fixes: ElementsProject#7655
Changelog-Fixed: Fixed intermittant bug where hsmd (particularly, but also lightningd) could use 100% CPU.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants