-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric updates when timeout occurs #2457
Conversation
…ibc-rs into ali/fix_oldest_sequence
.changelog/unreleased/bug-fixes/ibc-relayer-cli/2429-fix-oldest-sequence-timeout.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than Romain's comments, this looks good to me! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work Ali!
In additional to the bugs attached to the PR, we did some work refining the metrics and try to make them more useful, specifically:
- renamed
oldest_sequence
->backlog_oldest_sequence
- renamed
oldest_timestamp
->backlog_oldest_timestamp
- introduced
backlog_size
to complement the otherbacklog_*
data, as a metric reporting how many packets are pending on a channel
What's left to do:
- "There is an inconsistency between send packets count and acknowledgements count in the displayed metrics."
- This is Anca's comment from Telemetry: introduce
backlog_*
metrics, plus minor fixes #2451, need to investigate further
- This is Anca's comment from Telemetry: introduce
- Discovery phase as described in Telemetry: introduce
backlog_*
metrics, plus minor fixes #2451
I think both of these should be done in a follow-up PR. The present PR is linked to 3 issues and will be difficult to track what change is fixing what issue.
Discovery phase
Not sure how we can implement the discovery phase, because Hermes does not start packet workers for any channel until there's packets to relay on that channel, so not sure how we can populate the channel-specific metrics. For the other (non-channel-specific) metrics, we can probably initialize them with zero.
Inconsistency
For the "inconsistency between send packets count and acknowledgements count" we need some more digging. We actually have a lot of complexity around these metrics, there's four of them:
- send_packet_count
- acknowledgement_count
- cleared_send_packet_count
- cleared_acknowledgment_count
and its not clear how operators can use them (aside from inconsistencies). At the very least, we should document what is the purpose and insights they offer. Do they indicate activity? Lack of activity? Problems relaying? I'm not sure I know what they do.
I tested again and the bug is correctly fixed. |
* Update 2451-fix-oldest-sequence-timeout.md * Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md Co-authored-by: Adi Seredinschi <adi@informal.systems> * Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md Co-authored-by: Adi Seredinschi <adi@informal.systems> * Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md Co-authored-by: Adi Seredinschi <adi@informal.systems> * Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md Co-authored-by: Adi Seredinschi <adi@informal.systems> Co-authored-by: Adi Seredinschi <adi@informal.systems>
* clears telemetry sequence history on timeout * change description to incorporate the fact that oldest can also be waiting for a timeout * Delete json_encoder.rs * update guide * replace match by if let and replace e by event * fmt * changelog entry * Rename 2429-fix-oldest-sequence-timeout.md to 2451-fix-oldest-sequence-timeout.md * Update 2451-fix-oldest-sequence-timeout.md * Update 2451-fix-oldest-sequence-timeout.md * Renamed oldest_* metrics to backlog_*, refactored assoc. methods * Fix for informalsystems#2467 w/ Ali + refactor Co-authored-by: Adi Seredinschi <adi@informal.systems>
…#2457 (informalsystems#2476) * Update 2451-fix-oldest-sequence-timeout.md * Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md Co-authored-by: Adi Seredinschi <adi@informal.systems> * Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md Co-authored-by: Adi Seredinschi <adi@informal.systems> * Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md Co-authored-by: Adi Seredinschi <adi@informal.systems> * Update .changelog/unreleased/bug-fixes/ibc-relayer-cli/2451-fix-oldest-sequence-timeout.md Co-authored-by: Adi Seredinschi <adi@informal.systems> Co-authored-by: Adi Seredinschi <adi@informal.systems>
This PR fixes the following:
backlog_*
metrics, plus minor fixes #2451oldest_timestamp
gets updated withoutoldest_sequence
as more packets are pending #2469Pushed to follow-up work:
backlog_*
metrics, plus minor fixes #2451, need to investigate furtherbacklog_*
metrics, plus minor fixes #2451Both captured in #2479
Description
Previously,
oldest_*
metrics were not updated when a timeout would occur. Theoldest_*
metrics take their values from thesequences_histories
map which was updated only when an ACK was received.This PR handles the case of a timeout by calling
record_ack_history
when a timeout occurs.The description of the
oldest_*
metric is also updated to incorporate the fact that theoldest_*
is not only updated when an ACK occurs.PR author checklist:
unclog
.docs/
).Reviewer checklist:
Files changed
in the GitHub PR explorer.