-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hermes infinite retry clear packet #2155
Comments
How does this problem manifest more precisely? It is it the case that Hermes stops operating on "live" packets and is stubbornly retrying "old" packets? Or can you explain the issue in more detail? Note: The infinite retry approach was added as a requested feature from some operators (#1290). I can see how this can be problematic when the chain data was pruned, so thank you for bringing this up! |
when Hermes executes clean packet, it will take too long in the following cases:
when clean-height + pruning-keep-recent > latest height, the packet of this height will never be cleared. He cannot build the packet proof because the state has been prune conclusion: |
I think Hermes should improve the performance of clean packet instead of relying on "real-time relay". You can check my suggestions under this issue |
This is a very interesting scenario I'd like to reproduce and understand. Is there a way to reproduce it?
Yes, we got the same feedback from other operators, and will proceed this way. The next release will include a fix for 2087, including your suggestions. |
Thank you for your reply!
# before
$BINARY --home $CHAIN_DIR/$CHAIN_ID start --pruning=nothing --grpc.address="0.0.0.0:$GRPC_PORT" --log_level error > $CHAIN_DIR/$CHAIN_ID.log 2>&1 &
# after
$BINARY --home $CHAIN_DIR/$CHAIN_ID start --pruning=everything --grpc.address="0.0.0.0:$GRPC_PORT" --log_level error > $CHAIN_DIR/$CHAIN_ID.log 2>&1 &
./scripts/dev-env ~/.hermes/config.toml ibc-0 ibc-1
# note: this command may not work because the version I am using is not the latest version. The command has been changed in the latest version
hermes create channel --port-a transfer --port-b transfer ibc-0 --chain-b ibc-1 --new-client-connection
watch -n 1 hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 1 --key user2 --receiver cosmos1dgy29chm8nmt4aplxuwfl9dvzen8aunn4nq7z9 -n 200 -t 10000
watch -n 1 hermes tx raw ft-transfer ibc-0 ibc-1 transfer channel-0 1 --key user2 --receiver cosmos1dgy29chm8nmt4aplxuwfl9dvzen8aunn4nq7z9 -n 200 -t 10000
hermes start You will see an error message like this 2022-04-29T10:44:59.255498Z ERROR ThreadId(74) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: will retry: handling command encountered error: failed to construct packet proofs for chain ibc-0: ABCI query returned an error: AbciQuery { code: Err(18), log: Log("proof is unexpectedly empty; ensure height has not been pruned: invalid request"), info: "", index: 0, key: [], value: [], proof: None, height: block::Height(417), codespace: "sdk" } path=packet::channel-0/transfer:ibc-0->ibc-1 retry_index=286
2022-04-29T10:45:02.569352Z ERROR ThreadId(77) packet_cmd{src_chain=ibc-1 src_port=transfer src_channel=channel-0 dst_chain=ibc-0}: will retry: handling command encountered error: failed to construct packet proofs for chain ibc-1: ABCI query returned an error: AbciQuery { code: Err(18), log: Log("proof is unexpectedly empty; ensure height has not been pruned: invalid request"), info: "", index: 0, key: [], value: [], proof: None, height: block::Height(405), codespace: "sdk" } path=packet::channel-0/transfer:ibc-1->ibc-0 retry_index=286
2022-04-29T10:45:03.107963Z ERROR ThreadId(74) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: will retry: handling command encountered error: failed to construct packet proofs for chain ibc-0: ABCI query returned an error: AbciQuery { code: Err(18), log: Log("proof is unexpectedly empty; ensure height has not been pruned: invalid request"), info: "", index: 0, key: [], value: [], proof: None, height: block::Height(417), codespace: "sdk" } path=packet::channel-0/transfer:ibc-0->ibc-1 retry_index=287
2022-04-29T10:45:06.422157Z ERROR ThreadId(77) packet_cmd{src_chain=ibc-1 src_port=transfer src_channel=channel-0 dst_chain=ibc-0}: will retry: handling command encountered error: failed to construct packet proofs for chain ibc-1: ABCI query returned an error: AbciQuery { code: Err(18), log: Log("proof is unexpectedly empty; ensure height has not been pruned: invalid request"), info: "", index: 0, key: [], value: [], proof: None, height: block::Height(405), codespace: "sdk" } path=packet::channel-0/transfer:ibc-1->ibc-0 retry_index=287
2022-04-29T10:45:06.970827Z ERROR ThreadId(74) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: will retry: handling command encountered error: failed to construct packet proofs for chain ibc-0: ABCI query returned an error: AbciQuery { code: Err(18), log: Log("proof is unexpectedly empty; ensure height has not been pruned: invalid request"), info: "", index: 0, key: [], value: [], proof: None, height: block::Height(417), codespace: "sdk" } path=packet::channel-0/transfer:ibc-0->ibc-1 retry_index=288 Additional suggestions: if should_clear_packets(
is_first_run,
clear_on_start,
link.a_to_b.channel().ordering,
clear_interval,
height,
) {
// using latest height to clean packet
link.a_to_b.schedule_packet_clearing(None)
} else {
Ok(())
} |
Currently, Cosmos pruning-keep-recent is set to 100height by default, which means that we only have 100height-times to generate_opdata_from_events, if timeout occurs, the whole batch will fail. |
I managed to reproduce the error, thanks for the detailed instructions @lyqingye ! I am trying to estimate a solution and we'll try to address this in the next (June) release. |
## Description `spawn_packet_cmd_worker` runs `handle_packet_cmd` passing a "command" that can be `NewBlock`. The first thing it does is to try clearing packets. If this fails with ignorable error, the old code will retry immediately with same command and height. If the error persists (as in #2155) then the worker enters an infinite loop and new IBC events (coming via `IbcEvent` commands) are never processed. I believe the infinite loop started with #2238. Before that the worker would abort (due to `TaskError::Fatal(RunError::retry(e))`) The propose change is to move to the next command even if previous has failed. The reasons are: - it avoids the infinite loop - the failed events will still be processed at next clear interval with a fresh height - at different levels down in `handle_packet_cmd` there are retries mechanisms for MAX_RETRIES (current value hardcoded at 5). ## Commits * Do not retry clearing with same height forever * Undo one-chain script changes * Add changelog * Reword changelog entry * Remove dbg! statement * Formatting Co-authored-by: Romain Ruetschi <romain@informal.systems>
## Description `spawn_packet_cmd_worker` runs `handle_packet_cmd` passing a "command" that can be `NewBlock`. The first thing it does is to try clearing packets. If this fails with ignorable error, the old code will retry immediately with same command and height. If the error persists (as in informalsystems#2155) then the worker enters an infinite loop and new IBC events (coming via `IbcEvent` commands) are never processed. I believe the infinite loop started with informalsystems#2238. Before that the worker would abort (due to `TaskError::Fatal(RunError::retry(e))`) The propose change is to move to the next command even if previous has failed. The reasons are: - it avoids the infinite loop - the failed events will still be processed at next clear interval with a fresh height - at different levels down in `handle_packet_cmd` there are retries mechanisms for MAX_RETRIES (current value hardcoded at 5). ## Commits * Do not retry clearing with same height forever * Undo one-chain script changes * Add changelog * Reword changelog entry * Remove dbg! statement * Formatting Co-authored-by: Romain Ruetschi <romain@informal.systems>
Summary of Bug
https://github.com/informalsystems/ibc-rs/blob/master/relayer/src/worker/packet.rs#L120
Hermes will always clear the packet at a fixed height. When Hermes cannot build the packet proof at this height (the chain has been prune), Hermes will fall into an endless loop
Version
hermes 0.14.0
Steps to Reproduce
Acceptance Criteria
For Admin Use
The text was updated successfully, but these errors were encountered: