-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some issues and wrinkles with telemetry metrics #2408
Comments
As there have not been any requests for these metrics, what do you think about having a separate issue as a strech goal for these ?
There is a
The
What do you think about these help messages for the following three metrics concerned by
Regarding the buckets the issue is that the bucket sizes are for absolute values in milliseconds. So metrics which would make sens for some cases won’t for others. We can include some buckets but I am not sure what values to choose. Would you have an idea ?
The phrasing of the help message is confusing, the work pending is creating the confusion. The oldest_sequence is used in order to track if Hermes has observed a SendPacket event but not its corresponding WriteAcknowledgment event.
|
separate issue is ok. "stretch" not sure :) It is important to show where the relayer spends time. And we suspect it spends a lot of time the client operations. So the question is how much out of the full Tx latency do we spend with client operations? Are we doing a lot of client updates? But I guess I can use other profiling instead until this is implemented, if ever.
Ah, I see it now (just at the bottom), thanks!
Sounds better.
Not necessarily. If some
The
Not clear here. In my example above, what is the expected
Maybe also add a headline like:
Sounds good! While at it let's make the default
Yes, the ones I mention above or similar. And we try them on local setups and real chains. They should give more info than what we have now. Then we can start tuning them.
Yes, thanks for the clarification. |
Ok, I suggest that the adding the new metrics to monitor where the relayer spends time will be done in a separate PR to make it easier to review. I will work on the other changes and link this issue to the PR once it's ready.
I am not sure I understand the question. In the relayer code,
I am not sure to understand the example for the
Then start Hermes with the
As 20
Then the metrics will be:
Since the total number of
I suggest to add an additional metric,
But if we take the case where
Then the pending packets will not be taken into account as Hermes has not seen the first 20
|
Do we need to put the name used in the code here? I was asking because it looks out of place. |
I see, I put the ClearPendingPackets as it was in the Excalidraw, which seemed to be the name representing the event. But I can put something else if there is a more appropriate name. |
I think we say the same thing :) Can we make the count names a bit more consistent:
(can drop |
Yes good idea, I'll apply the changes to the PR |
Can we have a full template doc of the exact metrics and help strings? This would help with the review of user UX. imho we shouldn't assume the user knows the internal architecture and code. We can just say something like: |
Sure! |
One more question, is there a way to not show the |
Actually it's fine, i see they are in the guide. |
Everything except for the new metrics to monitor where the relayer spends time have been added to the PR linked to this issue |
thanks @ljoss17 ! i will look at it and ask more questions there. |
Ok, I think what I ask for doesn't make a lot of sense wrt to the latency. It would be nice to have counters for how many times we get headers (something in the same category with the queries counters) and how many |
Are the following two metrics the ones you expect ?
|
Let's start with the last one and you can open a new issue/PR for it. In the meantime i will try to find time to do profiling. Thanks! |
Summary of Bug
I was trying to make sense of the telemetry metrics and while I find some of them very useful (like the cache hits, number of queries, workers, etc) I am struggling with others. Here are some notes I took:
missing:
Number of WriteAcknowledgement relayed
but notNumber of SendPacket relayed
. Should probably fix wording as those are events, hermes acts on events but relays messages, e.g. onSendPacket
event it relays aMsgRecvPacket
message, onWriteAcknowledgement
event it relaysMsgAcknowledgement
.Number of SendPacket relayed through ClearPendingPackets
but notNumber of WriteAcknowledgement relayed through ClearPendingPackets
. Or is this supposed to count only once per packet lifetime? (i.e. potentially two message). Could this be clarified?confusing:
tx_confirmation = true
, help should mention this. But since almost no one runs with this set, I am wondering if we should record the times we send a msg successfully (even if we don't have a confirmation that it was not a duplicate or deliverTx has not failed). Not sure how it will work with retries.tx_latency_submitted_bucket
always showle="+Inf"}
. I see the comments below but not clear why we don't have some buckets. We should be able to configure something like10, 50, 100, 200, 500, 1000, 2000, Inf
no? At least explain there are no buckets or remove it altogether.incorrect (?):
the oldest sequence/ timestamp. I send some packets and start hermes with
clear_on_start = false
. Hermes will not relay those packets and can be confirmed withquery packet pending
CLI. However the metrics show:If hermes relays some packets the metrics above will flicker for a bit with the current sequence number/ timestamp and then go back to 0.
Version
master
Steps to Reproduce
Acceptance Criteria
Better documentation on metrics, what they mean exactly, how they are counted (especially the number of messages, events, etc.). Fix an of the metric inaccuracies. Provide metrics for light client operations as we suspect them to be heavy weights and one of the places we can try improvements.
For Admin Use
The text was updated successfully, but these errors were encountered: