Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clear_interval parameter is not working as expected #2445

Closed
5 tasks
achilleas-kal opened this issue Jul 20, 2022 · 13 comments
Closed
5 tasks

clear_interval parameter is not working as expected #2445

achilleas-kal opened this issue Jul 20, 2022 · 13 comments
Milestone

Comments

@achilleas-kal
Copy link
Contributor

Summary of Bug

clear_interval is set to 30 and packets are not being cleared automatically, I have to manually clear packets with the hermes clear packets command every once in a while.

Version

hermes 0.13.0+2375a069


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@adizere
Copy link
Member

adizere commented Jul 21, 2022

Can you share your [mode.packets] section of your config.toml?

@romac
Copy link
Member

romac commented Jul 21, 2022

Assuming the config is fine, we will need to see the logs in order to diagnose the issue.

@achilleas-kal
Copy link
Contributor Author

[mode.packets]
enabled = true

clear_interval = 30

clear_on_start = true

tx_confirmation = false

Sure, here's the config @adizere

Maybe tx_confirmation should be set to true for this feature to work properly?

@adizere
Copy link
Member

adizere commented Jul 25, 2022

Maybe tx_confirmation should be set to true for this feature to work properly?

Hmm, no. Periodic packet clearing should trigger every clear_interval blocks independently of tx_confirmation, but this is a good lead to investigate whether we have a bug.

Some situations where periodic packet clearing does not work is if:

  • the packet workers for the channel you want to see cleared did not start altogether (Hermes did not get any WS events associated with that channel since it started)
  • the channels are filtered out via policy based packet filtering

Do you think any of the two cases above might be the root cause? Are you seeing any activity for that channel in Hermes logs or metrics?

@achilleas-kal
Copy link
Contributor Author

the packet workers for the channel you want to see cleared did not start altogether (Hermes did not get any WS events associated with that channel since it started)

Well I've been making transfers while that packet was stuck so I'd say that's not the case.

the channels are filtered out via policy based packet filtering

The channel is not filtered for sure, generally if it was I wouldn't be able to relay packets at all plus the explicit clear command wouldn't work right? I've manually checked as well and the channel is allowed.

@adizere
Copy link
Member

adizere commented Jul 25, 2022

Thanks for the quick feedback and checking on those things!
This means we need to dig in the code + try to recreate this somehow. If you have any insights on what might help reproducing this behavior would be great.

the explicit clear command wouldn't work right

The explicit clear CLI ignores the packet filtering specified in config.tom, it would work.

@achilleas-kal
Copy link
Contributor Author

@adizere I can try to replicate a stuck packet on my end and dig into the logs, any insight on what I'm looking for would help.

@adizere
Copy link
Member

adizere commented Jul 27, 2022

thanks!

any insight on what I'm looking for would help.

Here are some things I would do if I had access to your setup:

  1. start Hermes

  2. do $ kill -USR1 $HERMES_PID

    • that should give the full details of which are the running workers in Hermes, example:
    • 2022-07-27T14:46:32.679956Z INFO ThreadId(38) dumping state (triggered by SIGUSR1)
      2022-07-27T14:46:32.693740Z INFO ThreadId(1114)
      2022-07-27T14:46:32.693880Z INFO ThreadId(1114) * Chains: ibc-0, ibc-1
      2022-07-27T14:46:32.693896Z INFO ThreadId(1114) * Client workers:
      2022-07-27T14:46:32.693908Z INFO ThreadId(1114) - client::ibc-0->ibc-1:07-tendermint-1 (id: 3)
      2022-07-27T14:46:32.693920Z INFO ThreadId(1114) | misbehaviour: true, refresh: true
      2022-07-27T14:46:32.693933Z INFO ThreadId(1114) - client::ibc-1->ibc-0:07-tendermint-1 (id: 1)
      2022-07-27T14:46:32.693992Z INFO ThreadId(1114) | misbehaviour: true, refresh: true
      2022-07-27T14:46:32.694011Z INFO ThreadId(1114) * Packet workers:
      2022-07-27T14:46:32.694025Z INFO ThreadId(1114) - packet::channel-3/transfer:ibc-0->ibc-1 (id: 5)
      2022-07-27T14:46:32.694036Z INFO ThreadId(1114) - packet::channel-4/transfer:ibc-1->ibc-0 (id: 6)
      2022-07-27T14:46:32.694048Z INFO ThreadId(1114) * Wallet workers:
      2022-07-27T14:46:32.694059Z INFO ThreadId(1114) - wallet::ibc-0 (id: 2)
      2022-07-27T14:46:32.694070Z INFO ThreadId(1114) - wallet::ibc-1 (id: 4)
      2022-07-27T14:46:32.694083Z INFO ThreadId(1114)

    • this allows to understand if Hermes is actually activating on the problematic channel
  3. Would be curious to know on what channel are packets stuck and which are the two networks at the ends of that channel

    • any logs showing activity on that channel would be of help
  4. The full log of clear packets might be useful

Generally, if you open up all logs from hermes start and point out the channel or packet sequence numbers to us we can likely find some interesting things.

@achilleas-kal
Copy link
Contributor Author

Hey @adizere, apologies for the delay just managed to look into this.

  1. Stuck packet: https://gist.github.com/achilleas-kal/3255482d5dd4effad48714ba88d3b85d (between Persistence and Injective, channel-41 and channel-82 respectively).
  2. Full dump logs of the last few minutes: https://gist.github.com/achilleas-kal/fda02c9b36b8b9c6ec88da2bf666a6f5
  3. Interestingly the packet was cleared after killing the pod, I assume through clear_on_start which relates to your issue here: Confusion regarding Hermes clear_interval feature #2462 so didn't manage to get the clear packets log
  4. I found something interesting in the full logs though:

2022-08-01T07:17:30.564645Z WARN ThreadId(41) refresh{client=07-tendermint-57 src_chain=injective-1 dst_chain=core-1}: task encountered ignorable error: error raised while trying to refresh client 07-tendermint-57: failed querying client state on dst chain: RPC error to endpoint https://tm.persistence.injective.network/: serde parse error: expected value at line 1 column 1

I got the serde parse error again but this time there is no reference to a data field as we show in: #2444 which also makes sense because the RPC in question (from Persistence) has both version and data fields exposed on the ABCI query. Something that might be worth looking at here.

  1. Some logs from the workers(filtered upon hermes start) https://gist.github.com/achilleas-kal/c5646bde8c140b2509e83b989030c37c

Hope this helps, btw many of our relayers on Injective have mentioned issues with stuck packets whereby they had to take manual action with hermes clear packets across a number of chains, it's not an isolated issue.

@romac romac changed the title clear_interval parameter is not working as expected clear_interval parameter is not working as expected Aug 3, 2022
@romac
Copy link
Member

romac commented Aug 3, 2022

4. I found something interesting in the full logs though:

2022-08-01T07:17:30.564645Z WARN ThreadId(41) refresh{client=07-tendermint-57 src_chain=injective-1 dst_chain=core-1}: task encountered ignorable error: error raised while trying to refresh client 07-tendermint-57: failed querying client state on dst chain: RPC error to endpoint https://tm.persistence.injective.network/: serde parse error: expected value at line 1 column 1

I got the serde parse error again but this time there is no reference to a data field as we show in: #2444 which also makes sense because the RPC in question (from Persistence) has both version and data fields exposed on the ABCI query. Something that might be worth looking at here.

This is a different query than in #2444, but rather an ABCI query for the path clients/07-tendermint-57/clientState, issued in the client refresh worker. If you can figure out from the logs at which height the query was issued, then you can perhaps try to query for this manually?

@romac
Copy link
Member

romac commented Aug 3, 2022

This seems to work as expected:

❯ curl https://tm.persistence.injective.network -d '{
  "jsonrpc": "2.0",
  "id": "d18e80f5-e2fc-4381-aa2c-996b97116722",
  "method": "abci_query",
  "params": {
    "path": "store/ibc/key",
    "data": "636C69656E74732F30372D74656E6465726D696E742D35372F636C69656E745374617465",
    "height": "0",
    "prove": false
  }
}'

Note 1: the value of params.data above is the hex-encoding of clients/07-tendermint-57/clientState

{
  "jsonrpc": "2.0",
  "id": "d18e80f5-e2fc-4381-aa2c-996b97116722",
  "result": {
    "response": {
      "code": 0,
      "log": "",
      "info": "",
      "index": "0",
      "key": "Y2xpZW50cy8wNy10ZW5kZXJtaW50LTU3L2NsaWVudFN0YXRl",
      "value": "CisvaWJjLmxpZ2h0Y2xpZW50cy50ZW5kZXJtaW50LnYxLkNsaWVudFN0YXRlEoMBCgtpbmplY3RpdmUtMRIECAEQAxoECIDqSSIECIDfbioCCCgyADoHCAEQjLG6BkIZCgkIARgBIAEqAQASDAoCAAEQIRgEIAwwAUIZCgkIARgBIAEqAQASDAoCAAEQIBgBIAEwAUoHdXBncmFkZUoQdXBncmFkZWRJQkNTdGF0ZVABWAE=",
      "proofOps": null,
      "height": "7284159",
      "codespace": ""
    }
  }
}

Same if we specify the height explicitly:

❯ curl https://tm.persistence.injective.network -d '{
  "jsonrpc": "2.0",
  "id": "d18e80f5-e2fc-4381-aa2c-996b97116722",
  "method": "abci_query",
  "params": {
    "path": "store/ibc/key",
    "data": "636C69656E74732F30372D74656E6465726D696E742D35372F636C69656E745374617465",
    "height": "7250081",
    "prove": false
  }
}'
{
  "jsonrpc": "2.0",
  "id": "d18e80f5-e2fc-4381-aa2c-996b97116722",
  "result": {
    "response": {
      "code": 0,
      "log": "",
      "info": "",
      "index": "0",
      "key": "Y2xpZW50cy8wNy10ZW5kZXJtaW50LTU3L2NsaWVudFN0YXRl",
      "value": "CisvaWJjLmxpZ2h0Y2xpZW50cy50ZW5kZXJtaW50LnYxLkNsaWVudFN0YXRlEoMBCgtpbmplY3RpdmUtMRIECAEQAxoECIDqSSIECIDfbioCCCgyADoHCAEQjLG6BkIZCgkIARgBIAEqAQASDAoCAAEQIRgEIAwwAUIZCgkIARgBIAEqAQASDAoCAAEQIBgBIAEwAUoHdXBncmFkZUoQdXBncmFkZWRJQkNTdGF0ZVABWAE=",
      "proofOps": null,
      "height": "7250081",
      "codespace": ""
    }
  }
}

Note 2: here I am using the height before the first NewBlock events found in the logs above: 7250082 - 1 = 7250081

@romac
Copy link
Member

romac commented Aug 3, 2022

I also tried setting up the injective-1 chain (I just specified the RPC address as ``), and hermes query client state gives me the expected result:

❯ RUST_LOG=tendermint_rpc=debug hermes query client state --chain injective-1 --client 07-tendermint-57
2022-08-03T15:40:06.639469Z DEBUG ThreadId(01) Incoming response: {
  "jsonrpc": "2.0",
  "id": "7c28b0ef-f6ef-49f2-935e-a323ade953af",
  "result": {
    "node_info": {
      "protocol_version": {
        "p2p": "8",
        "block": "11",
        "app": "0"
      },
      "id": "e0e1dd9d329d60c43b642f77dbce7199068fb2fa",
      "listen_addr": "tcp://65.0.93.49:26656",
      "network": "core-1",
      "version": "v0.34.14",
      "channels": "40202122233038606100",
      "moniker": "node",
      "other": {
        "tx_index": "on",
        "rpc_address": "tcp://0.0.0.0:26657"
      }
    },
    "sync_info": {
      "latest_block_hash": "E056FE4E05BC7D1CD5A9FB4F186ABFEA64A326E7B9CC106FDD285474431CF3FD",
      "latest_app_hash": "636C3884307740CF094C3042C0F157B82A5F8B9BDDFB3A004BF709F308A6F103",
      "latest_block_height": "7284306",
      "latest_block_time": "2022-08-03T15:39:58.404070601Z",
      "earliest_block_hash": "D9673C1D0DF6ACADCA29501A7EBC0D3FE6FBC7116FF2774D583C6625F07371F5",
      "earliest_app_hash": "E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855",
      "earliest_block_height": "1",
      "earliest_block_time": "2021-03-30T12:08:46.515027422Z",
      "catching_up": false
    },
    "validator_info": {
      "address": "059BD993FB36CC110108520614C3569AB913F2B2",
      "pub_key": {
        "type": "tendermint/PubKeyEd25519",
        "value": "lcjeCEqOE4ysLdoT+BWxLLZAOCtwgW10LMaVqgzQCA0="
      },
      "voting_power": "0"
    }
  }
}
2022-08-03T15:40:07.259676Z DEBUG ThreadId(15) Incoming response: {
  "jsonrpc": "2.0",
  "id": "e546034f-d93b-4e14-a818-936405e51f18",
  "result": {
    "response": {
      "code": 0,
      "log": "",
      "info": "",
      "index": "0",
      "key": "Y2xpZW50cy8wNy10ZW5kZXJtaW50LTU3L2NsaWVudFN0YXRl",
      "value": "CisvaWJjLmxpZ2h0Y2xpZW50cy50ZW5kZXJtaW50LnYxLkNsaWVudFN0YXRlEoMBCgtpbmplY3RpdmUtMRIECAEQAxoECIDqSSIECIDfbioCCCgyADoHCAEQjLG6BkIZCgkIARgBIAEqAQASDAoCAAEQIRgEIAwwAUIZCgkIARgBIAEqAQASDAoCAAEQIBgBIAEwAUoHdXBncmFkZUoQdXBncmFkZWRJQkNTdGF0ZVABWAE=",
      "proofOps": null,
      "height": "7284306",
      "codespace": ""
    }
  }
}
SUCCESS Tendermint(
    ClientState {
        chain_id: ChainId {
            id: "injective-1",
            version: 1,
        },
        trust_level: TrustThreshold {
            numerator: 1,
            denominator: 3,
        },
        trusting_period: 1209600s,
        unbonding_period: 1814400s,
        max_clock_drift: 40s,
        latest_height: Height {
            revision: 1,
            height: 13539468,
        },
        proof_specs: ProofSpecs(
            [
                ProofSpec(
                    ProofSpec {
                        leaf_spec: Some(
                            LeafOp {
                                hash: Sha256,
                                prehash_key: NoHash,
                                prehash_value: Sha256,
                                length: VarProto,
                                prefix: [
                                    0,
                                ],
                            },
                        ),
                        inner_spec: Some(
                            InnerSpec {
                                child_order: [
                                    0,
                                    1,
                                ],
                                child_size: 33,
                                min_prefix_length: 4,
                                max_prefix_length: 12,
                                empty_child: [],
                                hash: Sha256,
                            },
                        ),
                        max_depth: 0,
                        min_depth: 0,
                    },
                ),
                ProofSpec(
                    ProofSpec {
                        leaf_spec: Some(
                            LeafOp {
                                hash: Sha256,
                                prehash_key: NoHash,
                                prehash_value: Sha256,
                                length: VarProto,
                                prefix: [
                                    0,
                                ],
                            },
                        ),
                        inner_spec: Some(
                            InnerSpec {
                                child_order: [
                                    0,
                                    1,
                                ],
                                child_size: 32,
                                min_prefix_length: 1,
                                max_prefix_length: 1,
                                empty_child: [],
                                hash: Sha256,
                            },
                        ),
                        max_depth: 0,
                        min_depth: 0,
                    },
                ),
            ],
        ),
        upgrade_path: [
            "upgrade",
            "upgradedIBCState",
        ],
        allow_update: AllowUpdate {
            after_expiry: true,
            after_misbehaviour: true,
        },
        frozen_height: None,
    },
)

Note: I set RUST_LOG=tendermint_rpc=debug in the command above to inspect the raw JSON response of the command.

@adizere
Copy link
Member

adizere commented Sep 30, 2022

Closing. Confirmed with Achilleas that after upgrading to v1 the problem is no longer observed.

@adizere adizere closed this as completed Sep 30, 2022
@adizere adizere added this to the v1.1 milestone Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@romac @adizere @achilleas-kal and others