Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hermes caching layer MVP #1908

Closed
4 of 6 tasks
adizere opened this issue Feb 22, 2022 · 4 comments · Fixed by #1932
Closed
4 of 6 tasks

Hermes caching layer MVP #1908

adizere opened this issue Feb 22, 2022 · 4 comments · Fixed by #1932
Labels
I: logic Internal: related to the relaying logic O: usability Objective: cause to improve the user experience (UX) and ease using the product
Milestone

Comments

@adizere
Copy link
Member

adizere commented Feb 22, 2022

Crate

ibc-relayer

Summary

Add general support for a caching layer in Hermes.

Problem Definition

The motivation for the caching layer is to reduce pressure on full nodes.

Proposal & Acceptance Criteria

  • Hermes should cache the static data it is querying from a chain and using most frequently
    • using the profiling feature and the telemetry data may prove helpful to track down the most frequent queries
    • note that there is already a minimal cache implemented, which stores connections, channel, and client identifier data to filter by client trust threshold

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@adizere adizere added I: logic Internal: related to the relaying logic O: usability Objective: cause to improve the user experience (UX) and ease using the product P-high labels Feb 22, 2022
@adizere adizere added this to the v0.12.1 milestone Feb 22, 2022
@ancazamfir
Copy link
Collaborator

Great work in a short time! Just one observation, maybe you have discussed this already. I think we also need to maintain/ update the cache via the IBC events.
One problem now for example is that I can create a connection in Init-Try state, start the relayer with channel mode enabled and then separately finish the connection handshake. In this case the relayer will not relay on channel events over that connection because it thinks the connection is not opened. This will not be fixed by the connection TTL as I don't think we have the equivalent of clear packets at channel level (i.e. we don't rescan for unfinished channel handshakes).
The other case is a stale channel that hermes thinks is opened but that has been closed.

I see that the TTL for client state and latest height is relatively small and probably the stale cache is less noticeable. But we should consider fixing this also.

@adizere
Copy link
Member Author

adizere commented Mar 3, 2022

I think we also need to maintain/ update the cache via the IBC events.

We were debating this yesterday, but the architectural implications for making the cache accessible from the websocket (supervisor code) was not clear. It may actually be easy to pass the cache (after it is created, somewhere here) to the supervisor, but we haven't tried. The idea we toyed with was to automatically update the ClientState.latest_height automatically from the websocket upon detecting ClientUpdate events, but there was no time to work it through.

One problem now for example

Is this a problem that was introduced with PR #1932? I think we only cache channels and connections once they are in Open state. I don't fully understand the scenario yet.

The other case is a stale channel that hermes thinks is opened but that has been closed.

We debated the implications of this problem, and reduce the TTL from 10min to 1min for channel ends. Decided that the cost for Hermes to retry submitting packets on that (closed) channel for up to a minute before it finally detects the channel closure is a decent tradeoff for the moment.

@ancazamfir
Copy link
Collaborator

We were debating this yesterday, but the architectural implications for making the cache accessible from the websocket (supervisor code) was not clear. It may actually be easy to pass the cache (after it is created, somewhere here) to the supervisor, but we haven't tried. The idea we toyed with was to automatically update the ClientState.latest_height automatically from the websocket upon detecting ClientUpdate events, but there was no time to work it through.

Yes, I think we can get the chain latest_height from NewBlock and ClientState.latest_height from ClientUpdate.

Is this a problem that was introduced with PR #1932? I think we only cache channels and connections once they are in Open state. I don't fully understand the scenario yet.

You are right. Just tested now with an older gaia version and it's fine. I was hitting cosmos/ibc-go#601 not being included in gaia v6.0.3.

The other case is a stale channel that hermes thinks is opened but that has been closed.

We debated the implications of this problem, and reduce the TTL from 10min to 1min for channel ends. Decided that the cost for Hermes to retry submitting packets on that (closed) channel for up to a minute before it finally detects the channel closure is a decent tradeoff for the moment.

Agreed for the short term it's ok. My question was for long term. We get the IBC events anyway and may as well use them for cache updates. We could reduce the queries to proofs only. The drawback is that we need to deal with the websocket reconnects.

@romac
Copy link
Member

romac commented Mar 4, 2022

We get the IBC events anyway and may as well use them for cache updates. We could reduce the queries to proofs only. The drawback is that we need to deal with the websocket reconnects.

Yeah maybe we can invalidate the cache when we get disconnected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: logic Internal: related to the relaying logic O: usability Objective: cause to improve the user experience (UX) and ease using the product
Projects
No open projects
Status: Closed
Development

Successfully merging a pull request may close this issue.

3 participants