Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hermes infinite retry clear packet #2155

Closed
5 tasks
lyqingye opened this issue Apr 28, 2022 · 7 comments · Fixed by #2348
Closed
5 tasks

Hermes infinite retry clear packet #2155

lyqingye opened this issue Apr 28, 2022 · 7 comments · Fixed by #2348
Assignees
Labels
A: bug Admin: something isn't working
Milestone

Comments

@lyqingye
Copy link

Summary of Bug

https://github.com/informalsystems/ibc-rs/blob/master/relayer/src/worker/packet.rs#L120

 spawn_background_task(span, Some(Duration::from_millis(200)), move || {
        if let Ok(cmd) = cmd_rx.try_recv() {
            // infinity retry
            retry_with_index(retry_strategy::worker_stubborn_strategy(), |index| {
                handle_packet_cmd(
                    &mut is_first_run,
                    &link.lock().unwrap(),
                    clear_on_start,
                    clear_interval,
                    &path,
                    cmd.clone(),
                    index,
                )
            })
            .map_err(|e| TaskError::Fatal(RunError::retry(e)))?;
        }

        Ok(Next::Continue)
    })

Hermes will always clear the packet at a fixed height. When Hermes cannot build the packet proof at this height (the chain has been prune), Hermes will fall into an endless loop

Version

hermes 0.14.0

Steps to Reproduce

Acceptance Criteria


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@adizere
Copy link
Member

adizere commented Apr 28, 2022

How does this problem manifest more precisely? It is it the case that Hermes stops operating on "live" packets and is stubbornly retrying "old" packets? Or can you explain the issue in more detail?

Note: The infinite retry approach was added as a requested feature from some operators (#1290). I can see how this can be problematic when the chain data was pruned, so thank you for bringing this up!

@lyqingye
Copy link
Author

lyqingye commented Apr 28, 2022

when Hermes executes clean packet, it will take too long in the following cases:

  1. If there is no packet to relay on the long-term chain, it will take a long time to build the update client next time(verify light client to target height)
  2. an error was encountered during the clean packet process
  3. rpc delay

when clean-height + pruning-keep-recent > latest height, the packet of this height will never be cleared. He cannot build the packet proof because the state has been prune

conclusion:
when hermes cannot build proof, hermes should stop retrying, because retrying at this height will never succeed.

@lyqingye
Copy link
Author

lyqingye commented Apr 28, 2022

I think Hermes should improve the performance of clean packet instead of relying on "real-time relay". You can check my suggestions under this issue

@adizere
Copy link
Member

adizere commented Apr 29, 2022

conclusion:
when hermes cannot build proof, hermes should stop retrying, because retrying at this height will never succeed.

This is a very interesting scenario I'd like to reproduce and understand. Is there a way to reproduce it?

I think Hermes should improve the performance of clean packet instead of relying on "real-time relay". You can check my suggestions under this #2087

Yes, we got the same feedback from other operators, and will proceed this way. The next release will include a fix for 2087, including your suggestions.

@lyqingye
Copy link
Author

lyqingye commented Apr 29, 2022

Thank you for your reply!
Next, I will use the simplest way to reproduce this scene!

  1. In order to reproduce this scenario (clean height + pruning keep recent > latest height), we first adjust the startup parameters of GAIA.
    https://github.com/informalsystems/ibc-rs/blob/master/scripts/one-chain#L149
# before
$BINARY --home $CHAIN_DIR/$CHAIN_ID start --pruning=nothing --grpc.address="0.0.0.0:$GRPC_PORT" --log_level error > $CHAIN_DIR/$CHAIN_ID.log 2>&1 &

# after
$BINARY --home $CHAIN_DIR/$CHAIN_ID start --pruning=everything --grpc.address="0.0.0.0:$GRPC_PORT" --log_level error > $CHAIN_DIR/$CHAIN_ID.log 2>&1 &
  1. setup two chains
./scripts/dev-env ~/.hermes/config.toml ibc-0 ibc-1
  1. create channel
# note: this command may not work because the version I am using is not the latest version. The command has been changed in the latest version
hermes create channel --port-a transfer --port-b transfer ibc-0 --chain-b ibc-1 --new-client-connection
  1. start two process to send ibc transfer tx
watch -n 1 hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 1 --key user2 --receiver cosmos1dgy29chm8nmt4aplxuwfl9dvzen8aunn4nq7z9 -n 200 -t 10000

watch -n 1 hermes tx raw ft-transfer ibc-0 ibc-1 transfer channel-0 1 --key user2 --receiver cosmos1dgy29chm8nmt4aplxuwfl9dvzen8aunn4nq7z9 -n 200 -t 10000
  1. start Hermes and wait for the error to occur
hermes start

You will see an error message like this

2022-04-29T10:44:59.255498Z ERROR ThreadId(74) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: will retry: handling command encountered error: failed to construct packet proofs for chain ibc-0: ABCI query returned an error: AbciQuery { code: Err(18), log: Log("proof is unexpectedly empty; ensure height has not been pruned: invalid request"), info: "", index: 0, key: [], value: [], proof: None, height: block::Height(417), codespace: "sdk" } path=packet::channel-0/transfer:ibc-0->ibc-1 retry_index=286
2022-04-29T10:45:02.569352Z ERROR ThreadId(77) packet_cmd{src_chain=ibc-1 src_port=transfer src_channel=channel-0 dst_chain=ibc-0}: will retry: handling command encountered error: failed to construct packet proofs for chain ibc-1: ABCI query returned an error: AbciQuery { code: Err(18), log: Log("proof is unexpectedly empty; ensure height has not been pruned: invalid request"), info: "", index: 0, key: [], value: [], proof: None, height: block::Height(405), codespace: "sdk" } path=packet::channel-0/transfer:ibc-1->ibc-0 retry_index=286
2022-04-29T10:45:03.107963Z ERROR ThreadId(74) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: will retry: handling command encountered error: failed to construct packet proofs for chain ibc-0: ABCI query returned an error: AbciQuery { code: Err(18), log: Log("proof is unexpectedly empty; ensure height has not been pruned: invalid request"), info: "", index: 0, key: [], value: [], proof: None, height: block::Height(417), codespace: "sdk" } path=packet::channel-0/transfer:ibc-0->ibc-1 retry_index=287
2022-04-29T10:45:06.422157Z ERROR ThreadId(77) packet_cmd{src_chain=ibc-1 src_port=transfer src_channel=channel-0 dst_chain=ibc-0}: will retry: handling command encountered error: failed to construct packet proofs for chain ibc-1: ABCI query returned an error: AbciQuery { code: Err(18), log: Log("proof is unexpectedly empty; ensure height has not been pruned: invalid request"), info: "", index: 0, key: [], value: [], proof: None, height: block::Height(405), codespace: "sdk" } path=packet::channel-0/transfer:ibc-1->ibc-0 retry_index=287
2022-04-29T10:45:06.970827Z ERROR ThreadId(74) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: will retry: handling command encountered error: failed to construct packet proofs for chain ibc-0: ABCI query returned an error: AbciQuery { code: Err(18), log: Log("proof is unexpectedly empty; ensure height has not been pruned: invalid request"), info: "", index: 0, key: [], value: [], proof: None, height: block::Height(417), codespace: "sdk" } path=packet::channel-0/transfer:ibc-0->ibc-1 retry_index=288

Additional suggestions:
when Hermes clears the packet, the height should not be used, but the latest height on the chain should be used, so as to avoid pruning the packet state as much as possible

 if should_clear_packets(
                is_first_run,
                clear_on_start,
                link.a_to_b.channel().ordering,
                clear_interval,
                height,
            ) {
               // using latest height to clean packet
                link.a_to_b.schedule_packet_clearing(None)
            } else {
                Ok(())
            }

@lyqingye
Copy link
Author

lyqingye commented May 3, 2022

Currently, Cosmos pruning-keep-recent is set to 100height by default, which means that we only have 100height-times to generate_opdata_from_events, if timeout occurs, the whole batch will fail.
When the build packet proof fails, I think we can catch this exception and ignore the subsequent events, because the subsequent events will fail (the state has been trimmed)

@adizere
Copy link
Member

adizere commented May 12, 2022

I managed to reproduce the error, thanks for the detailed instructions @lyqingye !

I am trying to estimate a solution and we'll try to address this in the next (June) release.

@adizere adizere moved this to Backlog in IBC-rs: the road to v1 May 12, 2022
@adizere adizere added this to the v1.0.0 milestone May 12, 2022
@adizere adizere added the A: bug Admin: something isn't working label May 12, 2022
@adizere adizere moved this from Backlog to Needs PR in IBC-rs: the road to v1 Jun 21, 2022
@adizere adizere moved this to Backlog in IBC-rs: the road to v1 Jun 28, 2022
romac added a commit that referenced this issue Jun 30, 2022
## Description
`spawn_packet_cmd_worker` runs `handle_packet_cmd` passing a "command" that can be `NewBlock`.
The first thing it does is to try clearing packets. If this fails with ignorable error, the old code will retry immediately with same command and height. If the error persists (as in #2155) then the worker enters an infinite loop and new IBC events (coming via `IbcEvent` commands) are never processed.

I believe the infinite loop started with #2238. Before that the worker would abort (due to `TaskError::Fatal(RunError::retry(e))`)

The propose change is to move to the next command even if previous has failed. The reasons are:
- it avoids the infinite loop
- the failed events will still be processed at next clear interval with a fresh height
- at different levels down in `handle_packet_cmd` there are retries mechanisms for MAX_RETRIES (current value hardcoded at 5). 

## Commits

* Do not retry clearing with same height forever

* Undo one-chain script changes

* Add changelog

* Reword changelog entry

* Remove dbg! statement

* Formatting

Co-authored-by: Romain Ruetschi <romain@informal.systems>
Repository owner moved this from Backlog to Closed in IBC-rs: the road to v1 Jun 30, 2022
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this issue Sep 13, 2022
## Description
`spawn_packet_cmd_worker` runs `handle_packet_cmd` passing a "command" that can be `NewBlock`.
The first thing it does is to try clearing packets. If this fails with ignorable error, the old code will retry immediately with same command and height. If the error persists (as in informalsystems#2155) then the worker enters an infinite loop and new IBC events (coming via `IbcEvent` commands) are never processed.

I believe the infinite loop started with informalsystems#2238. Before that the worker would abort (due to `TaskError::Fatal(RunError::retry(e))`)

The propose change is to move to the next command even if previous has failed. The reasons are:
- it avoids the infinite loop
- the failed events will still be processed at next clear interval with a fresh height
- at different levels down in `handle_packet_cmd` there are retries mechanisms for MAX_RETRIES (current value hardcoded at 5). 

## Commits

* Do not retry clearing with same height forever

* Undo one-chain script changes

* Add changelog

* Reword changelog entry

* Remove dbg! statement

* Formatting

Co-authored-by: Romain Ruetschi <romain@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: bug Admin: something isn't working
Projects
No open projects
Status: Closed
Development

Successfully merging a pull request may close this issue.

3 participants