-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightning_hsmd at 100% CPU on cln 24.08 mainnet #7655
Comments
Concur. But I also see both lightning_hsmd and lightning_gossi keep running at 100% after I shut down lightingd:
They prevent my Ubuntu 22 from shutting down! |
Hmmmm, the best way to debug this is:
Then Obviously this only works when hsmd is consuming lots of CPU. Without knowing what it's doing, this is hard to debug! Presumably something is hammering it with requests: if we can figure out what the request is, that might give us a clue! |
Thanks! Ok, one more request, what does From this report, and another one, it's slamming the read/poll loop. I might have to add some debug for this. It doesn't happen for me :( |
My instance is running on a rpi4, the 100% cpu bug happens after approximately half a day of ok running, every day it seems. |
Another one with both commands: |
Noticed my
|
Our low-level ccan/io IO routines return three values: -1: error. 0: call me again, I'm not finished. 1: I'm done, go onto the next thing. In the last release, we tweaked the sematics of "-1": we now opportunistically call a routine which returns 0 once more, in case there's more data. We use errno to distinguish between "EAGAIN" which means there wasn't any data, and real errors. However, if the underlying read() returns 0 (which it does when the peer has closed the other end) the value of errno is UNDEFINED. If it happens to be EAGAIN, we will call it again, rather than closing. This causes us to spin: in particular people reported hsmd consuming 100% of CPU. The ccan/io read code handled this by setting errno to 0 in this case, but our own wire low-level routines *did not*. Fixes: ElementsProject#7655 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
OK, this did it! I found the case where we get a 0 read and ignore EOF. Thankyou, everyone, for your patience: I've added this to the v24.08.1 milestone, which will obviously need to be released soon. |
Our low-level ccan/io IO routines return three values: -1: error. 0: call me again, I'm not finished. 1: I'm done, go onto the next thing. In the last release, we tweaked the sematics of "-1": we now opportunistically call a routine which returns 0 once more, in case there's more data. We use errno to distinguish between "EAGAIN" which means there wasn't any data, and real errors. However, if the underlying read() returns 0 (which it does when the peer has closed the other end) the value of errno is UNDEFINED. If it happens to be EAGAIN, we will call it again, rather than closing. This causes us to spin: in particular people reported hsmd consuming 100% of CPU. The ccan/io read code handled this by setting errno to 0 in this case, but our own wire low-level routines *did not*. Fixes: ElementsProject#7655 Changelog-Fixed: Fixed intermittant bug where hsmd (particularly, but also lightningd) could use 100% CPU. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Our low-level ccan/io IO routines return three values: -1: error. 0: call me again, I'm not finished. 1: I'm done, go onto the next thing. In the last release, we tweaked the sematics of "-1": we now opportunistically call a routine which returns 0 once more, in case there's more data. We use errno to distinguish between "EAGAIN" which means there wasn't any data, and real errors. However, if the underlying read() returns 0 (which it does when the peer has closed the other end) the value of errno is UNDEFINED. If it happens to be EAGAIN, we will call it again, rather than closing. This causes us to spin: in particular people reported hsmd consuming 100% of CPU. The ccan/io read code handled this by setting errno to 0 in this case, but our own wire low-level routines *did not*. Fixes: ElementsProject#7655 Changelog-Fixed: Fixed intermittant bug where hsmd (particularly, but also lightningd) could use 100% CPU. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I've seen this multiple times now. lightning_hsmd sits at 100% CPU and i have to restart cln to get it back to normal. Strangely this is not happening with my testnet nodes running on the same machine and the same binaries.
The text was updated successfully, but these errors were encountered: