-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection worker #1019
Connection worker #1019
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @cezarad ! Thanks!!!
Co-authored-by: Romain Ruetschi <romain@informal.systems>
// Resume handshake on next iteration. | ||
resume_handshake = true; | ||
} else { | ||
resume_handshake = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see, in the smooth case that no unexpected errors happen, then once resume_handshake
is set to false, then it is not going to be set to true again ever. So does this means that the branch for WorkerCmd::NewBlock
is going to always be skipped? Otherwise, are we using some subtle error variants to trigger the Err
branch to set resume_handshake
to true again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does this means that the branch for WorkerCmd::NewBlock is going to always be skipped?
It's going to be executed if step_event()
or step_state()
have not advanced the handshake after a number of tries.
a_chain.clone(), | ||
b_chain.clone(), | ||
event.clone(), | ||
)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a ?
here short circuits the entire function and returns from the loop prematurely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct.
b_chain.clone(), | ||
self.connection.clone(), | ||
height, | ||
)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a ?
here short circuits the entire function and returns from the loop prematurely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct.
// Set on start or when event based handshake fails. | ||
let mut resume_handshake = true; | ||
|
||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens after the full connection handshake has been established? Is the loop supposed to continue forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always find complicated control flow inside loops difficult to follow. I understand you are just following the convention before. It would be great if we can use functions to structure the control flows instead. That would make it much easier to understand the control flow. For example something like follows:
pub(crate) fn run(self) -> Result<(), BoxError> {
enum Action {
Continue,
End,
}
fn do_run(worker: &ConnectionWorker) -> Result<Action, BoxError> { ... }
loop {
match do_run(&self) {
Ok(Action::End) => return Ok(()),
Err(e) => {
error!("error: {}. retrying");
// or return Err if unrecoverable
}
_ => {}
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens after the full connection handshake has been established? Is the loop supposed to continue forever?
no, see #1115
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always find complicated control flow inside loops difficult to follow. I understand you are just following the convention before. It would be great if we can use functions to structure the control flows instead.
...
Please open an issue for this and we will address in a separate PR.
loop { | ||
thread::sleep(Duration::from_millis(200)); | ||
|
||
if let Ok(cmd) = self.cmd_rx.try_recv() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If try_recv()
returns TryRecvError::Disconnected
, the sender side of the channel has been dropped and the loop is going to keep failing forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is true. All workers have the same pattern. The supervisor should not disconnect unless it has crashed and in this case not much can be done. But we could indeed check this error and exit. I think we discussed with @romac to have a way for the supervisor to gracefully close the worker, not sure this is done. I think the worker has a way to gracefully exit so the supervisor can clean its worker info.
} | ||
|
||
/// Run the event loop for events associated with a [`Connection`]. | ||
pub(crate) fn run(self) -> Result<(), BoxError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some documentation to explain a bit what is the expected execution flow of the function overall? My understanding is as follows:
- Wait for incoming block events for the connection state on the source chain
a_chain
to be updated. - Match the new state machine and update to the next connection state.
- Only cares about the direction from
a_chain
tob_chain
, i.e.(State::Init, State::TryOpen)
is not handled.
- Only cares about the direction from
- If the connection state has reached
(State::Open, State::Open)
, loop forever (should have been return right?) - If any error happens, retry until the connection succeed?
- The retry is in two stages: retry a number of times for each block event update, and retry again in the next block update.
Comments have been captured in #1118 and will be addressed in a follow-up PR
* under construction * under construction * identified connection end * compiles up to supervisor * Update supervisor.rs * compiles * Update counterparty.rs * without fmt and without clippy * fmt and clippy * option for counterparty in conn open try * Update connection.rs * Update connection.rs * Update supervisor.rs * added some telemetry to connection * update supervisor with connection * Update supervisor.rs * on going * conenction updates * Update supervisor.rs * e2e * Update connection.py * Update connection.py * Update connection.py * update e2e * merge * Review comments * Add user2 to CI * Added new files for gaia 4.2.0 to fix CI * Added config path info to e2e script (informalsystems#1019) * Fix output of hermes query connections * Move binding down to declaration site * Update relayer/src/chain/counterparty.rs Co-authored-by: Romain Ruetschi <romain@informal.systems> * Added reporting for the underlying error in counterparty_state() * Better import of IdentifiedConnectionEnd in cosmos.rs * Masked tonic::Code::NotFound result for query_client_connections. * Remove unwraps * Nit: fix comment & log output * Cleanup * Update changelog * Enable handshake completion on CI Co-authored-by: Anca Zamfir <zamfiranca@gmail.com> Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Romain Ruetschi <romain@informal.systems> Co-authored-by: Adi Seredinschi <adi@informal.systems>
worker for connection
Closes: #821
Description
hermes tx raw conn-try...
and make dst_connection_id and option (used to beConnectionId::default()
when not specified but this is a valid connection id.To setup for the tests:
all
(relay from all events, including connection handshake ones)./scripts/dev-env ~/.hermes/config.toml ibc-0 ibc-1
and thenhermes create client ibc-0 ibc-1
andhermes create client ibc-1 ibc-0
case A:
hermes start
hermes tx raw conn-init ibc-0 ibc-1 07-tendermint-0 07-tendermint-1
Case B:
hermes tx raw conn-init ibc-0 ibc-1 07-tendermint-0 07-tendermint-1
hermes start
Case C:
hermes tx raw conn-init ibc-0 ibc-1 07-tendermint-0 07-tendermint-1
hermes tx raw conn-try ibc-1 ibc-0 07-tendermint-1 07-tendermint-0 -s connection-x
, with connection-x being the ID of the connection returned in step 1 above.start the relayer
hermes start
Case D:
hermes tx raw conn-init ibc-0 ibc-1 07-tendermint-0 07-tendermint-1
hermes tx raw conn-try ibc-1 ibc-0 07-tendermint-1 07-tendermint-0 -s connection-x
,hermes tx raw conn-ack ibc-0 ibc-1 07-tendermint-0 07-tendermint-1 -d connection-x -s connection-y
, with connection-x being the ID of the connection returned in step 1 and connection-y the ID of the connection returned in step 2 abovehermes start
After the last step, wait a couple of seconds for hermes to finish the handshake and verify the connection state on both chains using hermes query connection end ibc-0 connection-x hermes query connection end ibc-0 transfer connection-y. They should both be in open state
TODO:
For contributor use:
docs/
) and code comments.Files changed
in the Github PR explorer.