Make supervisor more resilient to node going down #903

romac · 2021-05-06T07:25:38Z

Follow-up to #895
See also #871

Description

The supervisor should now be more resilient to a node going down temporarily.
Instead of sitting there waiting for events via the subscription, the supervisor is
now notified that something went wrong, while the event monitor will attempt to
reconnect for a limited time (max retries with a delay between attempts).

Errors yielded by the client and packet workers are now caught at the top-level run loop of the worker,
and printed to the console rather than causing the worker to exits. This is quite brittle still
and will need more work and thought put into for the next milestone.

Tested with

setup the chains, clients, connections, and channels:

❯ ./scripts/dev-env ibc-0 ibc-1 ibc-2
❯ hermes create channel ibc-0 ibc-1 --port-a transfer --port-b transfer -o unordered
❯ hermes create channel ibc-1 ibc-2 --port-a transfer --port-b transfer -o unordered

in a new console, with the debug log level in the config:

❯ hermes start-multi

send some packets if you want:

❯ hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 9999 1000 -n 5
...

kill one of the node, eg. ibc-1:

❯ ps aux | rg gaiad | rg ibc-1 | awk '{ print $2 }' | xargs -I{} kill -9 {}

watch the output of start-multi, it should show some errors about the WebSocket connection being down and perhaps some RPC queries failing but keep going and retrying to connect to the WebSocket.
start the nodes again (will automatically kill the remaining ones):

❯ ./scripts/dev-env ibc-0 ibc-1 ibc-2
❯ hermes create channel ibc-0 ibc-1 --port-a transfer --port-b transfer -o unordered
❯ hermes create channel ibc-1 ibc-2 --port-a transfer --port-b transfer -o unordered

after a while it should be able to reconnect and you can send some packets

❯ hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 9999 1000 -n 5

For contributor use:

Updated the Unreleased section of CHANGELOG.md with the issue.
If applicable: Unit tests written, added test to CI.
Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
Updated relevant documentation (docs/) and code comments.
Re-reviewed Files changed in the Github PR explorer.

…h the subscription channel

relayer/src/event/monitor.rs

relayer/src/chain/handle.rs

ancazamfir · 2021-05-06T13:42:53Z

I still see an error when starting with 3 chains in config but only two gaia processes. The output is:

May 06 15:22:30.864  INFO ibc_relayer_cli::commands: Using default configuration from: '.hermes/config.toml'
May 06 15:22:30.879 DEBUG ibc_relayer::event::monitor: subscribing to query: tm.event = 'Tx'
May 06 15:22:30.880 DEBUG ibc_relayer::event::monitor: subscribing to query: tm.event = 'NewBlock'
May 06 15:22:30.881 DEBUG ibc_relayer::event::monitor: subscribed to all queries
May 06 15:22:30.881  INFO ibc_relayer::event::monitor: starting event monitor chain.id=ibc-0
May 06 15:22:30.881 TRACE ibc_relayer::registry: spawned chain runtime for chain identifier ibc-0
May 06 15:22:30.885 DEBUG ibc_relayer::event::monitor: subscribing to query: tm.event = 'Tx'
May 06 15:22:30.886 DEBUG ibc_relayer::event::monitor: subscribing to query: tm.event = 'NewBlock'
May 06 15:22:30.887 DEBUG ibc_relayer::event::monitor: subscribed to all queries
May 06 15:22:30.887 TRACE ibc_relayer::registry: spawned chain runtime for chain identifier ibc-1
May 06 15:22:30.887  INFO ibc_relayer::event::monitor: starting event monitor chain.id=ibc-1
Error: RPC error to endpoint http://127.0.0.1:26457/: error trying to connect: tcp connect error: Connection refused (os error 61) (code: 0)

I believe the error comes from here (see ?):
https://github.com/informalsystems/ibc-rs/blob/ecbdb07a8bbed69e9cd9f88778a4b50edf0f5567/relayer/src/supervisor.rs#L332-L334

We should ignore the error...maybe create subscription as part of spawn_workers() below since we iter the chains there anyway.

romac · 2021-05-06T13:57:19Z

@ancazamfir Should be fixed in 493191b.

relayer/src/supervisor.rs

ancazamfir · 2021-05-06T14:21:36Z

Another nice to have is this: If there are 3 chains in config and only two chains/ gaiad nodes are up then I should see the same behavior regardless on how this state was reached. Right now:

when hermes starts with 2 nodes up and 3 chains, it tries to init 3rd chain and then it relays only for the two chains. It doesn't retry to reconnect with the third chain.
when hermes starts with 3 nodes and then one is killed, it relays over the two chains but keeps retrying to reconnect with the third.

ancazamfir · 2021-05-06T14:26:00Z

Also the trying to reconnect to WebSocket end... and error when reconnecting are too often maybe especially because they are warn/ error and happen every 5 sec (exponential backoff would be nice here) or maybe give up after a while.

adizere

I think we should merge this.

Will open a follow-up PR to address Anca's comment regarding the backoff for the worker (error when reconnecting).

romac added 4 commits May 6, 2021 09:23

Restart event monitor when node goes down and propagate errors throug…

39305e0

…h the subscription channel

Prevent worker from exiting on error

e2c8085

Improve debug output a little

3161ade

Small refactor

31c992a

romac marked this pull request as ready for review May 6, 2021 07:55

romac requested a review from ancazamfir as a code owner May 6, 2021 07:55

romac requested a review from adizere May 6, 2021 07:55

adizere reviewed May 6, 2021

View reviewed changes

relayer/src/event/monitor.rs Outdated Show resolved Hide resolved

relayer/src/chain/handle.rs Outdated Show resolved Hide resolved

romac added 3 commits May 6, 2021 14:51

Gracefully handle runtime failing to start rather than unwrapping

6534c88

Update changelog

c1d8b34

Merge branch 'master' into romac/restart-ws-client

ecbdb07

romac added 2 commits May 6, 2021 15:48

Replace println! with error!

d75bc22

Do not crash if chain runtime fails to initialize or subscribe to events

493191b

ancazamfir reviewed May 6, 2021

View reviewed changes

relayer/src/supervisor.rs Outdated Show resolved Hide resolved

Remove trace! for decreased verbosity

747bdd4

romac added 6 commits May 6, 2021 17:10

Merge branch 'master' into romac/restart-ws-client

48371e3

Retry WebSocket connection with exponential backoff

7b094a1

Simplify types

96597c2

Extract retry strategy helpers to ibc_relayer::util::retry module

31fff98

Use fibonacci increment backoff strategy

de4e2b1

Merge branch 'master' into romac/restart-ws-client

5e32c74

adizere approved these changes May 6, 2021

View reviewed changes

ancazamfir approved these changes May 6, 2021

View reviewed changes

romac merged commit 20d8fff into master May 6, 2021

adizere mentioned this pull request May 6, 2021

Retry & backoff mechanism in worker loop #913

Merged

10 tasks

romac mentioned this pull request May 14, 2021

Make UniChanPath worker more resilient to node going down #943

Closed

5 tasks

adizere mentioned this pull request Jun 1, 2021

Enable logging in tests and fix error log flooding from mock chain runtime #1017

Merged

5 tasks

ancazamfir deleted the romac/restart-ws-client branch June 1, 2021 08:40

adizere mentioned this pull request Jun 1, 2021

Regression bug: Hermes is unable to re-establish monitor connection after node goes down #1026

Closed

5 tasks

hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022

Make supervisor more resilient to node going down (informalsystems#903)

22659bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make supervisor more resilient to node going down #903

Make supervisor more resilient to node going down #903

romac commented May 6, 2021 •

edited

Loading

ancazamfir commented May 6, 2021

romac commented May 6, 2021

ancazamfir commented May 6, 2021

ancazamfir commented May 6, 2021

adizere left a comment

Make supervisor more resilient to node going down #903

Make supervisor more resilient to node going down #903

Conversation

romac commented May 6, 2021 • edited Loading

Description

Tested with

ancazamfir commented May 6, 2021

romac commented May 6, 2021

ancazamfir commented May 6, 2021

ancazamfir commented May 6, 2021

adizere left a comment

Choose a reason for hiding this comment

romac commented May 6, 2021 •

edited

Loading