Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subds: fix case where we keep retrying on EOF. #7661

Merged

Conversation

rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented Sep 13, 2024

Our low-level ccan/io IO routines return three values: -1: error.
0: call me again, I'm not finished.
1: I'm done, go onto the next thing.

In the last release, we tweaked the sematics of "-1": we now opportunistically call a routine which returns 0 once more, in case there's more data. We use errno to distinguish between "EAGAIN" which means there wasn't any data, and real errors.

However, if the underlying read() returns 0 (which it does when the peer has closed the other end) the value of errno is UNDEFINED. If it happens to be EAGAIN, we will call it again, rather than closing. This causes us to spin: in particular people reported hsmd consuming 100% of CPU.

The ccan/io read code handled this by setting errno to 0 in this case, but our own wire low-level routines did not.

Fixes: #7655

@rustyrussell rustyrussell added this to the v24.08.1 milestone Sep 13, 2024
Our low-level ccan/io IO routines return three values:
-1: error.
0: call me again, I'm not finished.
1: I'm done, go onto the next thing.

In the last release, we tweaked the sematics of "-1": we now opportunistically
call a routine which returns 0 once more, in case there's more data.  We use errno to
distinguish between "EAGAIN" which means there wasn't any data, and real errors.

However, if the underlying read() returns 0 (which it does when the peer has closed
the other end) the value of errno is UNDEFINED.  If it happens to be EAGAIN, we will
call it again, rather than closing.  This causes us to spin: in particular people reported
hsmd consuming 100% of CPU.

The ccan/io read code handled this by setting errno to 0 in this case, but our own
wire low-level routines *did not*.

Fixes: ElementsProject#7655
Changelog-Fixed: Fixed intermittant bug where hsmd (particularly, but also lightningd) could use 100% CPU.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@ShahanaFarooqui
Copy link
Collaborator

ACK 93b05a4

@ShahanaFarooqui ShahanaFarooqui merged commit 5bd3d51 into ElementsProject:master Sep 13, 2024
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lightning_hsmd at 100% CPU on cln 24.08 mainnet
2 participants