Skip to content
This repository has been archived by the owner on Jun 3, 2020. It is now read-only.

Network timeouts #310

Closed
tarcieri opened this issue Jul 25, 2019 · 4 comments
Closed

Network timeouts #310

tarcieri opened this issue Jul 25, 2019 · 4 comments
Labels
enhancement New feature or request

Comments

@tarcieri
Copy link
Contributor

The Secret Connection implementation in tendermint-rs presently has no notion of timeouts for things like connecting/"dialing" validators or reading from/writing to sockets.

Without timeouts, there are many cases in which the KMS could potentially deadlock indefinitely inside of blocking I/O operations, and at least one user has experienced this.

Here are a few recommendations:

  • futures/tokio: this is the up-and-coming ecosystem solution to this general problem, and the one I'd recommend. It's a full asynchronous event model which solves, among other things, timeouts. When async/await support ships in Rust 1.38 (scheduled to be released in August), migrating from blocking I/O should be fairly straightforward.
  • libc crate + poll(2) system call: if we wanted to stick with blocking I/O, the poll(2) system call, as invoked through the libc crate, can be used to determine I/O readiness prior to performing a blocking I/O call, and also takes a timeout as an argument. This would probably be the lowest impact way to implement timeouts as it wouldn't involve switching away from blocking I/O.
  • Watchdog: the existing threads run in a loop, which can periodically send a heartbeat (via e.g. a channel) to a watchdog thread. The watchdog can kill and restart threads which appear to be deadlocked. I think this might be a good idea in general, but it seems like the nuclear option and it would probably be good only as a last resort, with one of the more graceful approaches above used as the first line of defense. I'm a bit unclear on what the Rust semantics are when e.g. a thread dies while holding a mutex - it may cause a PoisonError which would corrupt the state of the whole program and cause it to crash.
@tarcieri tarcieri added the enhancement New feature or request label Jul 25, 2019
@mdyring
Copy link
Contributor

mdyring commented Jul 26, 2019

We've seen same behavior in production so would appreciate work on this. :-)

@tarcieri
Copy link
Contributor Author

tarcieri commented Aug 1, 2019

Moving this issue to informalsystems/tendermint-rs#2 since the crate containing the relevant code has been relocated there

@tarcieri tarcieri closed this as completed Aug 1, 2019
@liamsi
Copy link
Contributor

liamsi commented Dec 11, 2019

reopening here as the relevant code was moved back to the kms

@liamsi liamsi reopened this Dec 11, 2019
@tarcieri
Copy link
Contributor Author

Fixed in v0.7.0 (see #364)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants