Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric updates when timeout occurs #2457

Merged
merged 14 commits into from
Jul 28, 2022
Merged

Conversation

AlianBenabdallah
Copy link
Contributor

@AlianBenabdallah AlianBenabdallah commented Jul 25, 2022

This PR fixes the following:

Pushed to follow-up work:

Both captured in #2479

Description

  1. Even after a channel is cleared, oldest_* metrics can remain to the same value (i.e., not reset to 0).
  1. I think we should clarify what the oldest_timestamp is, it seems this field is a local timestamp to the Hermes process, not an on-chain packet timestamp (when the packet was created), which is not clear from the telemetry help message, specifically:
  1. There is an inconsistency between send packets count and acknowledgements count in the displayed metrics.

Previously, oldest_* metrics were not updated when a timeout would occur. The oldest_* metrics take their values from the sequences_histories map which was updated only when an ACK was received.

This PR handles the case of a timeout by calling record_ack_history when a timeout occurs.

The description of the oldest_* metric is also updated to incorporate the fact that the oldest_* is not only updated when an ACK occurs.


PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

@AlianBenabdallah AlianBenabdallah self-assigned this Jul 25, 2022
@AlianBenabdallah AlianBenabdallah added the I: telemetry Internal: related to Telemetry & metrics label Jul 25, 2022
relayer/src/worker/packet.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@seanchen1991 seanchen1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than Romain's comments, this looks good to me! 👍

@adizere adizere self-assigned this Jul 27, 2022
Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work Ali!

In additional to the bugs attached to the PR, we did some work refining the metrics and try to make them more useful, specifically:

  • renamed oldest_sequence -> backlog_oldest_sequence
  • renamed oldest_timestamp -> backlog_oldest_timestamp
  • introduced backlog_size to complement the other backlog_* data, as a metric reporting how many packets are pending on a channel

What's left to do:

I think both of these should be done in a follow-up PR. The present PR is linked to 3 issues and will be difficult to track what change is fixing what issue.

Discovery phase

Not sure how we can implement the discovery phase, because Hermes does not start packet workers for any channel until there's packets to relay on that channel, so not sure how we can populate the channel-specific metrics. For the other (non-channel-specific) metrics, we can probably initialize them with zero.

Inconsistency

For the "inconsistency between send packets count and acknowledgements count" we need some more digging. We actually have a lot of complexity around these metrics, there's four of them:

  1. send_packet_count
  2. acknowledgement_count
  3. cleared_send_packet_count
  4. cleared_acknowledgment_count

and its not clear how operators can use them (aside from inconsistencies). At the very least, we should document what is the purpose and insights they offer. Do they indicate activity? Lack of activity? Problems relaying? I'm not sure I know what they do.

@AlianBenabdallah
Copy link
Contributor Author

I tested again and the bug is correctly fixed.

@AlianBenabdallah AlianBenabdallah merged commit 1d18818 into master Jul 28, 2022
@AlianBenabdallah AlianBenabdallah deleted the ali/fix_oldest_sequence branch July 28, 2022 18:18
romac pushed a commit that referenced this pull request Jul 29, 2022
* Update 2451-fix-oldest-sequence-timeout.md

* Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md

Co-authored-by: Adi Seredinschi <adi@informal.systems>

* Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md

Co-authored-by: Adi Seredinschi <adi@informal.systems>

* Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md

Co-authored-by: Adi Seredinschi <adi@informal.systems>

* Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md

Co-authored-by: Adi Seredinschi <adi@informal.systems>

Co-authored-by: Adi Seredinschi <adi@informal.systems>
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
* clears telemetry sequence history on timeout

* change description to incorporate the fact that oldest can also be waiting for a timeout

* Delete json_encoder.rs

* update guide

* replace match by if let and replace e by event

* fmt

* changelog entry

* Rename 2429-fix-oldest-sequence-timeout.md to 2451-fix-oldest-sequence-timeout.md

* Update 2451-fix-oldest-sequence-timeout.md

* Update 2451-fix-oldest-sequence-timeout.md

* Renamed oldest_* metrics to backlog_*, refactored assoc. methods

* Fix for informalsystems#2467 w/ Ali + refactor

Co-authored-by: Adi Seredinschi <adi@informal.systems>
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
…#2457 (informalsystems#2476)

* Update 2451-fix-oldest-sequence-timeout.md

* Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md

Co-authored-by: Adi Seredinschi <adi@informal.systems>

* Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md

Co-authored-by: Adi Seredinschi <adi@informal.systems>

* Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md

Co-authored-by: Adi Seredinschi <adi@informal.systems>

* Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md

Co-authored-by: Adi Seredinschi <adi@informal.systems>

Co-authored-by: Adi Seredinschi <adi@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: telemetry Internal: related to Telemetry & metrics
Projects
None yet
4 participants