-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transports/tcp: Remove sleep_on_error #2849
Conversation
The `sleep_on_error` mechanism in `libp2p-tcp` would delay the next poll on the listener stream when an error happens. This mechanism was introduced in libp2p#402 based on the [`tk_listen`](https://docs.rs/tk-listen/latest/tk_listen/) crate also referenced in the tokio documentation. When running out of file descriptors, the listening socket would return an `EMFILE`. Instead of polling the socket again, thus likely receiving another `EMFILE`, one would instead wait for 100ms. Modern operating systems should run with high file descriptor limits and thus this error should not happen in the wild. In addition, delaying the next poll only covers up the issue, but does not solve it. With the above in mind, this pull request removes the daly. (The mention of `tk_listen` has since been removed from tokio.)
As far as I can tell, go-libp2p does not pause when witnessing an error on Maybe @marten-seemann @MarcoPolo or @julian88110 can confirm? |
No we don't wait. |
We've seen it a few times, for instance on macos or whatever
You assume this is the only program running on the server, imo it's never ok to burn a thread for no reason
What would solving it look like? You either raise the fd limit, or lower the maximum number of peers
In nimbus, we show a warning at startup if the fd limit is too close / below the maximum peer amount (if we can, it's not straightforward to get the actual limit). Otherwise, we sleep after a failed accept to avoid the busy loop |
Thanks @marten-seemann and @Menduist.
Good point.
Those are the only solutions I can think of as well. My thought was that a busy loop would surface the issue (e.g. low file descriptor count), while the delay would cover the issue up when running in production. Unless one pays attention to the additional logged errors, which I doubt is happening in most setups.
For the record, corresponding logic in Substrate: https://github.com/paritytech/substrate/blob/00cc5f104176fac6f5a624bced22a2192c7c0470/client/cli/src/config.rs#L652-L660 Will give this more thought, but leaning towards keeping and better documenting the delay with this additional input. |
That's a good point |
It might be a good idea to record this as a metric, and make this delay configurable. An operator may want to alert when this happens since it may mean they should increase their FD limits or something is exhausting their FDs (leak?). |
We can return an error via |
I am in favor of @thomaseizinger suggestion above. That said I will not get to this any time soon. In case someone wants to take this over, let us know. |
The error is already reported as of today. Unless another sub-task is woken, this shouldn't result in a busy loop? |
I had a play around with this and pushed a little test harness with a docker container to test the behaviour for when we run out of file descriptors. In the current implementation, the error is already returned but it is a busy loop. Can we perhaps change the implementation that, once we return this particular error once, we return |
🚀 neat! Thank you.
Given the |
I was referring to the implementation in this PR! I am not sure about the stall. Would have to experiment. If you are running out of FD you can't accept new connections anyway? |
Even though you might not be able to accept new connections in such case, I think the issue here is that the sys call will still attempt to do so. Thus there is a busy loop. |
This pull request has merge conflicts. Could you please resolve them @mxinden? 🙏 |
Closing here since stale and not needed. |
Description
The
sleep_on_error
mechanism inlibp2p-tcp
would delay the next poll on thelistener stream when an error happens. This mechanism was introduced in
#402 based on the
tk_listen
crate also referencedin the tokio documentation.
When running out of file descriptors, the listening socket would return an
EMFILE
. Instead of polling the socket again, thus likely receiving anotherEMFILE
, potentially resulting in a busy loop, one would instead wait for 100ms.Modern operating systems should run with high file descriptor limits and thus
this error should not happen in the wild. In addition, delaying the next poll
only covers up the issue, but does not solve it.
Lastly rust-libp2p prioritizes incoming connections the lowest. Thus, while
this could result in a busy loop, it only does in case there is no other work left.
With the above in mind, this pull request removes the daly.
(The mention of
tk_listen
has since been removed from tokio.)Links to any relevant issues
sleep_on_error
transports/tcp: simplify IfWatcher integration #2813 (comment)Open Questions
Change checklist