Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection worker #1019

Merged
merged 50 commits into from
Jun 22, 2021
Merged

Connection worker #1019

merged 50 commits into from
Jun 22, 2021

Conversation

cezarad
Copy link
Contributor

@cezarad cezarad commented May 31, 2021

worker for connection

Closes: #821

Description

  • added a worker for connection
  • added query_connections to the chain handle and runtime
  • allow entering the destination connection id in hermes tx raw conn-try... and make dst_connection_id and option (used to be ConnectionId::default() when not specified but this is a valid connection id.
  • added tests to e2e

To setup for the tests:

  • change strategy in config to all (relay from all events, including connection handshake ones)
  • execute start the chains using ./scripts/dev-env ~/.hermes/config.toml ibc-0 ibc-1 and then
  • create clients on ibc-0 and ibc-1 with hermes create client ibc-0 ibc-1 and hermes create client ibc-1 ibc-0

case A:

  1. start the relayer hermes start
  2. send hermes tx raw conn-init ibc-0 ibc-1 07-tendermint-0 07-tendermint-1

Case B:

  1. send hermes tx raw conn-init ibc-0 ibc-1 07-tendermint-0 07-tendermint-1
  2. start the relayer hermes start

Case C:

  1. send hermes tx raw conn-init ibc-0 ibc-1 07-tendermint-0 07-tendermint-1
  2. send hermes tx raw conn-try ibc-1 ibc-0 07-tendermint-1 07-tendermint-0 -s connection-x , with connection-x being the ID of the connection returned in step 1 above.
    start the relayer hermes start

Case D:

  1. send hermes tx raw conn-init ibc-0 ibc-1 07-tendermint-0 07-tendermint-1
  2. sendhermes tx raw conn-try ibc-1 ibc-0 07-tendermint-1 07-tendermint-0 -s connection-x,
  3. send hermes tx raw conn-ack ibc-0 ibc-1 07-tendermint-0 07-tendermint-1 -d connection-x -s connection-y , with connection-x being the ID of the connection returned in step 1 and connection-y the ID of the connection returned in step 2 above
  4. start the relayer hermes start

After the last step, wait a couple of seconds for hermes to finish the handshake and verify the connection state on both chains using hermes query connection end ibc-0 connection-x hermes query connection end ibc-0 transfer connection-y. They should both be in open state

TODO:


For contributor use:

  • Updated the Unreleased section of CHANGELOG.md with the issue.
  • If applicable: Unit tests written, added test to CI.
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Updated relevant documentation (docs/) and code comments.
  • Re-reviewed Files changed in the Github PR explorer.

@romac romac changed the title Connection821 Connection worker Jun 2, 2021
@cezarad cezarad requested a review from ancazamfir June 7, 2021 09:17
@cezarad cezarad marked this pull request as ready for review June 9, 2021 18:42
Copy link
Collaborator

@ancazamfir ancazamfir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @cezarad ! Thanks!!!

relayer/src/supervisor.rs Outdated Show resolved Hide resolved
@romac romac mentioned this pull request Jun 21, 2021
8 tasks
// Resume handshake on next iteration.
resume_handshake = true;
} else {
resume_handshake = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see, in the smooth case that no unexpected errors happen, then once resume_handshake is set to false, then it is not going to be set to true again ever. So does this means that the branch for WorkerCmd::NewBlock is going to always be skipped? Otherwise, are we using some subtle error variants to trigger the Err branch to set resume_handshake to true again?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this means that the branch for WorkerCmd::NewBlock is going to always be skipped?

It's going to be executed if step_event() or step_state() have not advanced the handshake after a number of tries.

a_chain.clone(),
b_chain.clone(),
event.clone(),
)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a ? here short circuits the entire function and returns from the loop prematurely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct.

b_chain.clone(),
self.connection.clone(),
height,
)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a ? here short circuits the entire function and returns from the loop prematurely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct.

// Set on start or when event based handshake fails.
let mut resume_handshake = true;

loop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens after the full connection handshake has been established? Is the loop supposed to continue forever?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always find complicated control flow inside loops difficult to follow. I understand you are just following the convention before. It would be great if we can use functions to structure the control flows instead. That would make it much easier to understand the control flow. For example something like follows:

    pub(crate) fn run(self) -> Result<(), BoxError> {
        enum Action {
            Continue,
            End,
        }

        fn do_run(worker: &ConnectionWorker) -> Result<Action, BoxError> { ... }

        loop {
            match do_run(&self) {
                Ok(Action::End) => return Ok(()),
                Err(e) => {
                    error!("error: {}. retrying");
                    // or return Err if unrecoverable
                }
                _ => {}
            }
        }
  }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens after the full connection handshake has been established? Is the loop supposed to continue forever?

no, see #1115

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always find complicated control flow inside loops difficult to follow. I understand you are just following the convention before. It would be great if we can use functions to structure the control flows instead.
...

Please open an issue for this and we will address in a separate PR.

loop {
thread::sleep(Duration::from_millis(200));

if let Ok(cmd) = self.cmd_rx.try_recv() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If try_recv() returns TryRecvError::Disconnected, the sender side of the channel has been dropped and the loop is going to keep failing forever.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is true. All workers have the same pattern. The supervisor should not disconnect unless it has crashed and in this case not much can be done. But we could indeed check this error and exit. I think we discussed with @romac to have a way for the supervisor to gracefully close the worker, not sure this is done. I think the worker has a way to gracefully exit so the supervisor can clean its worker info.

}

/// Run the event loop for events associated with a [`Connection`].
pub(crate) fn run(self) -> Result<(), BoxError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some documentation to explain a bit what is the expected execution flow of the function overall? My understanding is as follows:

  • Wait for incoming block events for the connection state on the source chain a_chain to be updated.
  • Match the new state machine and update to the next connection state.
    • Only cares about the direction from a_chain to b_chain, i.e. (State::Init, State::TryOpen) is not handled.
  • If the connection state has reached (State::Open, State::Open), loop forever (should have been return right?)
  • If any error happens, retry until the connection succeed?
    • The retry is in two stages: retry a number of times for each block event update, and retry again in the next block update.

@romac romac mentioned this pull request Jun 22, 2021
11 tasks
@romac romac dismissed soareschen’s stale review June 22, 2021 14:18

Comments have been captured in #1118 and will be addressed in a follow-up PR

@romac romac merged commit 9d3ffdb into master Jun 22, 2021
@romac romac deleted the connection821 branch June 22, 2021 14:49
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
* under construction

* under construction

* identified connection end

* compiles up to supervisor

* Update supervisor.rs

* compiles

* Update counterparty.rs

* without fmt and without clippy

* fmt and clippy

* option for counterparty in conn open try

* Update connection.rs

* Update connection.rs

* Update supervisor.rs

* added some telemetry to connection

* update supervisor with connection

* Update supervisor.rs

* on going

* conenction updates

* Update supervisor.rs

* e2e

* Update connection.py

* Update connection.py

* Update connection.py

* update e2e

* merge

* Review comments

* Add user2 to CI

* Added new files for gaia 4.2.0 to fix CI

* Added config path info to e2e script (informalsystems#1019)

* Fix output of hermes query connections

* Move binding down to declaration site

* Update relayer/src/chain/counterparty.rs

Co-authored-by: Romain Ruetschi <romain@informal.systems>

* Added reporting for the underlying error in counterparty_state()

* Better import of IdentifiedConnectionEnd in cosmos.rs

* Masked tonic::Code::NotFound result for query_client_connections.

* Remove unwraps

* Nit: fix comment & log output

* Cleanup

* Update changelog

* Enable handshake completion on CI

Co-authored-by: Anca Zamfir <zamfiranca@gmail.com>
Co-authored-by: Andy Nogueira <me@andynogueira.dev>
Co-authored-by: Romain Ruetschi <romain@informal.systems>
Co-authored-by: Adi Seredinschi <adi@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add connection worker to the reactive relayer
6 participants