Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some issues and wrinkles with telemetry metrics #2408

Closed
1 of 5 tasks
ancazamfir opened this issue Jul 14, 2022 · 16 comments · Fixed by #2409
Closed
1 of 5 tasks

Some issues and wrinkles with telemetry metrics #2408

ancazamfir opened this issue Jul 14, 2022 · 16 comments · Fixed by #2409
Milestone

Comments

@ancazamfir
Copy link
Collaborator

ancazamfir commented Jul 14, 2022

Summary of Bug

I was trying to make sense of the telemetry metrics and while I find some of them very useful (like the cache hits, number of queries, workers, etc) I am struggling with others. Here are some notes I took:

missing:

  • light client queries and latencies for queries, verification, etc.
  • there is Number of WriteAcknowledgement relayed but not Number of SendPacket relayed. Should probably fix wording as those are events, hermes acts on events but relays messages, e.g. on SendPacket event it relays a MsgRecvPacket message, on WriteAcknowledgement event it relays MsgAcknowledgement.
  • there is a Number of SendPacket relayed through ClearPendingPackets but not Number of WriteAcknowledgement relayed through ClearPendingPackets. Or is this supposed to count only once per packet lifetime? (i.e. potentially two message). Could this be clarified?
  • number of update client messages

confusing:

  • relayed per channel only shows with tx_confirmation = true, help should mention this. But since almost no one runs with this set, I am wondering if we should record the times we send a msg successfully (even if we don't have a confirmation that it was not a duplicate or deliverTx has not failed). Not sure how it will work with retries.
  • tx_latency_submitted_bucket always show le="+Inf"}. I see the comments below but not clear why we don't have some buckets. We should be able to configure something like 10, 50, 100, 200, 500, 1000, 2000, Inf no? At least explain there are no buckets or remove it altogether.
    // TODO: Once quantile sketches are supported, replace histograms with that.
    // For the moment, disable histogram buckets since no values make sense for all use-cases.
    

incorrect (?):
the oldest sequence/ timestamp. I send some packets and start hermes with clear_on_start = false. Hermes will not relay those packets and can be confirmed with query packet pending CLI. However the metrics show:

# HELP oldest_sequence The sequence number of the oldest pending SendPacket. If this value is 0, it means there are no pending SendPacket
# TYPE oldest_sequence gauge
oldest_sequence{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 0
# HELP oldest_timestamp The timestamp of the oldest sequence number in seconds
# TYPE oldest_timestamp gauge
oldest_timestamp{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 0

If hermes relays some packets the metrics above will flicker for a bit with the current sequence number/ timestamp and then go back to 0.

Version

master

Steps to Reproduce

  • run hermes, send packets, check metrics
  • send packets, start hermes, check metrics
  • send packets, relay some MsgRecvPacket (use tx raw), send more packets, start hermes
  • try with different configuration parameters

Acceptance Criteria

Better documentation on metrics, what they mean exactly, how they are counted (especially the number of messages, events, etc.). Fix an of the metric inaccuracies. Provide metrics for light client operations as we suspect them to be heavy weights and one of the places we can try improvements.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@ljoss17
Copy link
Contributor

ljoss17 commented Jul 14, 2022

missing:

  • light client queries and latencies for queries, verification, etc.
  • number of update client messages

As there have not been any requests for these metrics, what do you think about having a separate issue as a strech goal for these ?

  • there is Number of WriteAcknowledgement relayed but not Number of SendPacket relayed. Should probably fix wording as those are events, hermes acts on events but relays messages, e.g. on SendPacket event it relays a MsgRecvPacket message, on WriteAcknowledgement event it relays MsgAcknowledgement.

There is a send_packet_count for the Number of SendPacket relayed. The wording could indeed be improved, would it be more clear if with the following help message:

  • "# HELP send_packet_count Number of SendPacket events observed"
  • "# HELP acknowledgement_count Number of WriteAcknowledgment observed"
  • "# HELP cleared_count Number of SendPacket observed during ClearPendingPackets"
  • "# HELP cleared_acknowledgment_count Number of WriteAcknowledgment observed during ClearPending Packets"
  • there is a Number of SendPacket relayed through ClearPendingPackets but not Number of WriteAcknowledgement relayed through ClearPendingPackets. Or is this supposed to count only once per packet lifetime? (i.e. potentially two message). Could this be clarified?

The send_packet_count and acknowledgement_count are used together in order to know if Hermes observed the same number of SendPacket events and WriteAcknowledgment events. If everything is working correctly these number should be the same.
The cleared_count gives additional information on the send_packet_count, being of the number of SendPacket observed, how many where due to ClearPendingPackets. If this value increases it could point to an issue where Hermes is unstable.
But a "WriteAcknowledgment observed during ClearPending Packets" is indeed missing, as the current implementation could have a mismatch between send_packet_count and acknowledgement_count if there are WriteAcknowledgment pending before Hermes starts.

confusing:

  • relayed per channel only shows with tx_confirmation = true, help should mention this. But since almost no one runs with this set, I am wondering if we should record the times we send a msg successfully (even if we don't have a confirmation that it was not a duplicate or deliverTx has not failed). Not sure how it will work with retries.

What do you think about these help messages for the following three metrics concerned by tx_confirmed = true:

  • "# HELP ibc_acknowledgment_packets Number of confirmed acknowledgment packets relayed per channel"
  • "# HELP ibc_receive_packets Number of confirmed receive packets relayed per channel"
  • "# HELP ibc_timeout_packets Number of confirmed timeout packets relayed per channel"
    As well as updating the help message in the example conf.toml file to:
# Toggle the transaction confirmation mechanism.
# The tx confirmation mechanism periodically queries the `/tx_search` RPC
# endpoint to check that previously-submitted transactions
# (to any chain in this config file) have been successfully delivered.
# If they have not been, and `clear_interval = 0`, then those packets are
# queued up for re-submission.
# Experimental feature. If set to false this will disable telemetry
# `ibc_acknowledgment_packets`, `ibc_receive_packets` and `ibc_timeout_packets`.
# [Default: true]
  • tx_latency_submitted_bucket always show le="+Inf"}. I see the comments below but not clear why we don't have some buckets. We should be able to configure something like 10, 50, 100, 200, 500, 1000, 2000, Inf no? At least explain there are no buckets or remove it altogether.
    // TODO: Once quantile sketches are supported, replace histograms with that.
    // For the moment, disable histogram buckets since no values make sense for all use-cases.
    

Regarding the buckets the issue is that the bucket sizes are for absolute values in milliseconds. So metrics which would make sens for some cases won’t for others. We can include some buckets but I am not sure what values to choose. Would you have an idea ?

incorrect (?): the oldest sequence/ timestamp. I send some packets and start hermes with clear_on_start = false. Hermes will not relay those packets and can be confirmed with query packet pending CLI. However the metrics show:

# HELP oldest_sequence The sequence number of the oldest pending SendPacket. If this value is 0, it means there are no pending SendPacket
# TYPE oldest_sequence gauge
oldest_sequence{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 0
# HELP oldest_timestamp The timestamp of the oldest sequence number in seconds
# TYPE oldest_timestamp gauge
oldest_timestamp{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 0

If hermes relays some packets the metrics above will flicker for a bit with the current sequence number/ timestamp and then go back to 0.

The phrasing of the help message is confusing, the work pending is creating the confusion. The oldest_sequence is used in order to track if Hermes has observed a SendPacket event but not its corresponding WriteAcknowledgment event.
So pending packets are not taken into account as Hermes does not see the SendPacket event as long as they are pending.
Would the following help message be more clear:

  • "#HELP oldest_sequence The sequence number of the oldest SendPacket observed without its corresponding WriteAcknowledgement event."

@ancazamfir
Copy link
Collaborator Author

missing:

  • light client queries and latencies for queries, verification, etc.
  • number of update client messages

As there have not been any requests for these metrics, what do you think about having a separate issue as a strech goal for these ?

separate issue is ok. "stretch" not sure :) It is important to show where the relayer spends time. And we suspect it spends a lot of time the client operations. So the question is how much out of the full Tx latency do we spend with client operations? Are we doing a lot of client updates? But I guess I can use other profiling instead until this is implemented, if ever.

  • there is Number of WriteAcknowledgement relayed but not Number of SendPacket relayed. Should probably fix wording as those are events, hermes acts on events but relays messages, e.g. on SendPacket event it relays a MsgRecvPacket message, on WriteAcknowledgement event it relays MsgAcknowledgement.

There is a send_packet_count for the Number of SendPacket relayed.

Ah, I see it now (just at the bottom), thanks!

The wording could indeed be improved, would it be more clear if with the following help message:

  • "# HELP send_packet_count Number of SendPacket events observed"
  • "# HELP acknowledgement_count Number of WriteAcknowledgment observed"
  • "# HELP cleared_count Number of SendPacket observed during ClearPendingPackets"
  • "# HELP cleared_acknowledgment_count Number of WriteAcknowledgment observed during ClearPending Packets"

Sounds better. cleared_count -> clear_send_packet_count. And maybe
"..count Number of <event> events processed ..."
I believe the first includes the latter (the one with during ClearPendingPackets). If some real time event processing fails the events are re-generated and processed during clearing. Not sure how to make this clear.
btw, where is "ClearPendingPackets" coming from? And what would be the name for the realtime/ live event mechanism?

  • there is a Number of SendPacket relayed through ClearPendingPackets but not Number of WriteAcknowledgement relayed through ClearPendingPackets. Or is this supposed to count only once per packet lifetime? (i.e. potentially two message). Could this be clarified?

The send_packet_count and acknowledgement_count are used together in order to know if Hermes observed the same number of SendPacket events and WriteAcknowledgment events. If everything is working correctly these number should be the same.

Not necessarily. If some MsgRecvPacket-s have been sent but not the corresponding MsgAcknowledgment-s. For example, i start with 49 pending packets, and MsgRecvPacket sent for the first 30. I expect when I start the relayer, once it's done, to see 19 MsgRecvPacket sent (plus the client updates) and 49 MsgAcknowledgment-s. The metrics show:

acknowledgement_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 30
send_packet_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 30

# HELP cleared_count Number of SendPacket relayed through ClearPendingPackets
cleared_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 30

msg_num{chain="ibc-0"} 31
msg_num{chain="ibc-1"} 52

The msg_num above are correct (i assume they include the client updates)

The cleared_count gives additional information on the send_packet_count, being of the number of SendPacket observed, how many where due to ClearPendingPackets.

Not clear here. In my example above, what is the expected send_packet_count? Should it be 30 like it is? And in this case should there be a write_acknowledgment_count of 49?

If this value increases it could point to an issue where Hermes is unstable. But a "WriteAcknowledgment observed during ClearPending Packets" is indeed missing, as the current implementation could have a mismatch between send_packet_count and acknowledgement_count if there are WriteAcknowledgment pending before Hermes starts.

confusing:

  • relayed per channel only shows with tx_confirmation = true, help should mention this. But since almost no one runs with this set, I am wondering if we should record the times we send a msg successfully (even if we don't have a confirmation that it was not a duplicate or deliverTx has not failed). Not sure how it will work with retries.

What do you think about these help messages for the following three metrics concerned by tx_confirmed = true:

  • "# HELP ibc_acknowledgment_packets Number of confirmed acknowledgment packets relayed per channel"
  • "# HELP ibc_receive_packets Number of confirmed receive packets relayed per channel"
  • "# HELP ibc_timeout_packets Number of confirmed timeout packets relayed per channel"

Maybe also add a headline like:

  • "# HELP Number of confirmed packets relayed per channel. Available if relayer runs with Tx confirmation enabled"

As well as updating the help message in the example conf.toml file to:

# Toggle the transaction confirmation mechanism.
# The tx confirmation mechanism periodically queries the `/tx_search` RPC
# endpoint to check that previously-submitted transactions
# (to any chain in this config file) have been successfully delivered.
# If they have not been, and `clear_interval = 0`, then those packets are
# queued up for re-submission.
# Experimental feature. If set to false this will disable telemetry
# `ibc_acknowledgment_packets`, `ibc_receive_packets` and `ibc_timeout_packets`.
# [Default: true]

Sounds good! While at it let's make the default false here and in the code (if not already).

  • tx_latency_submitted_bucket always show le="+Inf"}. I see the comments below but not clear why we don't have some buckets. We should be able to configure something like 10, 50, 100, 200, 500, 1000, 2000, Inf no? At least explain there are no buckets or remove it altogether.
    // TODO: Once quantile sketches are supported, replace histograms with that.
    // For the moment, disable histogram buckets since no values make sense for all use-cases.
    

Regarding the buckets the issue is that the bucket sizes are for absolute values in milliseconds. So metrics which would make sens for some cases won’t for others. We can include some buckets but I am not sure what values to choose. Would you have an idea ?

Yes, the ones I mention above or similar. And we try them on local setups and real chains. They should give more info than what we have now. Then we can start tuning them.

incorrect (?): the oldest sequence/ timestamp. I send some packets and start hermes with clear_on_start = false. Hermes will not relay those packets and can be confirmed with query packet pending CLI. However the metrics show:

# HELP oldest_sequence The sequence number of the oldest pending SendPacket. If this value is 0, it means there are no pending SendPacket
# TYPE oldest_sequence gauge
oldest_sequence{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 0
# HELP oldest_timestamp The timestamp of the oldest sequence number in seconds
# TYPE oldest_timestamp gauge
oldest_timestamp{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 0

If hermes relays some packets the metrics above will flicker for a bit with the current sequence number/ timestamp and then go back to 0.

The phrasing of the help message is confusing, the work pending is creating the confusion. The oldest_sequence is used in order to track if Hermes has observed a SendPacket event but not its corresponding WriteAcknowledgment event. So pending packets are not taken into account as Hermes does not see the SendPacket event as long as they are pending. Would the following help message be more clear:

  • "#HELP oldest_sequence The sequence number of the oldest SendPacket observed without its corresponding WriteAcknowledgement event."

Yes, thanks for the clarification.

@ljoss17
Copy link
Contributor

ljoss17 commented Jul 15, 2022

Ok, I suggest that the adding the new metrics to monitor where the relayer spends time will be done in a separate PR to make it easier to review.

I will work on the other changes and link this issue to the PR once it's ready.

btw, where is "ClearPendingPackets" coming from? And what would be the name for the realtime/ live event mechanism?

I am not sure I understand the question. In the relayer code, relayer/src/worker/packet.rs line 107, it seems to be when a ClearPendingPackets command is received. Or in the Excalidraw https://app.excalidraw.com/l/4XqkU6POmGI/9X8Mv47sKuD it should be the Worker::ClearPendingPackets

Not necessarily. If some MsgRecvPacket-s have been sent but not the corresponding MsgAcknowledgment-s. For example, i start with 49 pending packets, and MsgRecvPacket sent for the first 30. I expect when I start the relayer, once it's done, to see 19 MsgRecvPacket sent (plus the client updates) and 49 MsgAcknowledgment-s. The metrics show:

I am not sure to understand the example for the send_packet_count and write_acknowledgment_count.
When Hermes sees any SendPacket event, it will increment send_packet_count, and additionally the cleared_count will show how many of these SendPacket events have been observed while relaying messages when the ClearPendingPackets command is received. So if we look only at the example where we have 20 pending MsgRecvPacket:

  • hermes tx ft-transfer .... --number-msgs 20

Then start Hermes with the clear_on_start = true option, at this point the metrics will be as following:

cleared_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 20
send_packet_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 20
acknowledgement_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 20

As 20 SendPacket events have been observed, and of these 20, 20 where observed when Hermes received the ClearPendingPackets command. If we now send 15 additional messages while Hermes is running:

  • hermes tx ft-transfer .... --number-msgs 15

Then the metrics will be:

cleared_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 20
send_packet_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 35
acknowledgement_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 35

Since the total number of SendPacket events seen is 35, but only the first 20 where observed during the ClearPendingPackets.
Regarding the Acknowledgments, at the moment there is a potential mismatch. If there are 15 pending MsgAcknowledgment, we start Hermes and send an additional 20 messages, the metrics will be:

cleared_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 0
send_packet_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 20
acknowledgement_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 35

I suggest to add an additional metric, cleared_acknowledgment_count, which would count only the WriteAcknowledgment observed during the ClearPendingPackets. So the previous example would yield the following metrics:

cleared_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 0
send_packet_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 20
acknowledgement_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 20
cleared_acknowledgment_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 15

But if we take the case where clear_on_start = false and run the following:

  1. hermes tx ft-transfer .... --number-msgs 20
  2. hermes start
  3. hermes tx ft-transfer .... --number-msgs 15

Then the pending packets will not be taken into account as Hermes has not seen the first 20 SendPacket and WriteAcknowledgment event, thus the metrics will be:

cleared_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 0
send_packet_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 15
acknowledgement_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 15

@ancazamfir
Copy link
Collaborator Author

I am not sure I understand the question. In the relayer code, relayer/src/worker/packet.rs line 107, it seems to be when a ClearPendingPackets command is received. Or in the Excalidraw https://app.excalidraw.com/l/4XqkU6POmGI/9X8Mv47sKuD it should be the Worker::ClearPendingPackets

Do we need to put the name used in the code here? I was asking because it looks out of place.

@ljoss17
Copy link
Contributor

ljoss17 commented Jul 15, 2022

I see, I put the ClearPendingPackets as it was in the Excalidraw, which seemed to be the name representing the event. But I can put something else if there is a more appropriate name.

@ancazamfir
Copy link
Collaborator Author

ancazamfir commented Jul 15, 2022

I am not sure to understand the example for the send_packet_count and write_acknowledgment_count...

I think we say the same thing :) Can we make the count names a bit more consistent:

cleared_send_packet_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 0
cleared_acknowledgment_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 15
total_send_packet_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 20
total_acknowledgement_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 20

(can drop total i guess)
Also the help strings.

@ljoss17
Copy link
Contributor

ljoss17 commented Jul 15, 2022

Yes good idea, I'll apply the changes to the PR

@ancazamfir
Copy link
Collaborator Author

I see, I put the ClearPendingPackets as it was in the Excalidraw, which seemed to be the name representing the event. But I can put something else if there is a more appropriate name.

Can we have a full template doc of the exact metrics and help strings? This would help with the review of user UX.

imho we shouldn't assume the user knows the internal architecture and code. We can just say something like:
# HELP cleared_count number of SendPacket relayed during the initial and periodic clearing.
The user should know about this functionality. We can add more to the help to say they show only if they are enabled in the configuration.

@ancazamfir
Copy link
Collaborator Author

Ok, I suggest that the adding the new metrics to monitor where the relayer spends time will be done in a separate PR to make it easier to review.

Sure!

@ancazamfir
Copy link
Collaborator Author

One more question, is there a way to not show the ibc_..._packets counters if confirmation is not set? Or if they are 0?

@ancazamfir
Copy link
Collaborator Author

Can we have a full template doc of the exact metrics and help strings? This would help with the review of user UX.

Actually it's fine, i see they are in the guide.

@ljoss17
Copy link
Contributor

ljoss17 commented Jul 15, 2022

Everything except for the new metrics to monitor where the relayer spends time have been added to the PR linked to this issue

@ancazamfir
Copy link
Collaborator Author

Everything except for the new metrics to monitor where the relayer spends time have been added to the PR linked to this issue

thanks @ljoss17 ! i will look at it and ask more questions there.

@ancazamfir
Copy link
Collaborator Author

missing:

  • light client queries and latencies for queries, verification, etc.
  • number of update client messages

As there have not been any requests for these metrics, what do you think about having a separate issue as a strech goal for these ?

Ok, I think what I ask for doesn't make a lot of sense wrt to the latency. It would be nice to have counters for how many times we get headers (something in the same category with the queries counters) and how many MsgUpdateClient messages we sent (in same group with ibc_..._packets counters).

@ljoss17
Copy link
Contributor

ljoss17 commented Jul 19, 2022

missing:

  • light client queries and latencies for queries, verification, etc.
  • number of update client messages

As there have not been any requests for these metrics, what do you think about having a separate issue as a strech goal for these ?

Ok, I think what I ask for doesn't make a lot of sense wrt to the latency. It would be nice to have counters for how many times we get headers (something in the same category with the queries counters) and how many MsgUpdateClient messages we sent (in same group with ibc_..._packets counters).

Are the following two metrics the ones you expect ?

  • A counter for how many LIghtBlocks we query, which are used to create the IBC headers.
  • A counter for how many MsgUpdateClient messages we send.

@adizere adizere added this to the v1.0.0 milestone Jul 19, 2022
@ancazamfir
Copy link
Collaborator Author

Are the following two metrics the ones you expect ?

  • A counter for how many LIghtBlocks we query, which are used to create the IBC headers.
  • A counter for how many MsgUpdateClient messages we send.

Let's start with the last one and you can open a new issue/PR for it. In the meantime i will try to find time to do profiling. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants