-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: adding kafka brokers metrics #196
Conversation
Adding a note, this is a continuation of this PR, I just could not make changes to that fork so I needed to create a new one. |
@jsuereth , I'm tagging you since it looks like you were the assignee on this one. We have a lot of eyes on our end wanting to get this one to the finish line, please let me know if there is anything I can do to make sure this one is good to go. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the metrics to another section ### Broker Metrics
Co-authored-by: Dmitrii Anoshin <anoshindx@gmail.com>
Hi @dmitryax , I'd like to make some progress on this while I'm waiting on approval for the semantic conventions PR. Can you tell me what a CI failure is? I'm not seeing any actual error code is so I don't know what needs to be fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this PR documents what's already done on Kafka and there are limitations on what can be changed, but I suggest to stay as close to general otel metric requirements as possible.
I left several comments on this.
Also, I'd recommend using messaging.kafka.broker
(not brokers
) as a namespace.
docs/messaging/kafka.md
Outdated
| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | | | ||
| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` | | ||
| messaging.kafka.brokers.network.io | Counter | Int64 | bytes | `By` | The bytes received or sent by the broker. | | | | ||
| messaging.kafka.brokers.requests.latency | Gauge | Double | ms | `{ms}` | Average Request latency in ms. | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to report it as a histogram? Then we'll have percentiles and messaging.kafka.brokers.requests.rate
can also be derived from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I would much prefer to see this as a histogram. Even better, in my opinion, would be an exponential histogram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't have any experience converting histograms from metrics.go into histograms for OTeL. Are there other examples where someone has done this in the past? The kafka consumers for example is already tracking lag
as a gauge
metric, so I thought the same would make sense here.
I could see it potentially being an upgrade, I'm just not sure how to apply it to update the collector.
docs/messaging/kafka.md
Outdated
|
||
| Name | Instrument | Value type | Unit | Unit ([UCUM](/docs/general/metrics.md#instrument-units)) | Description | Attribute Key | Attribute Values | | ||
| ---------------------------------------------| ------------- | ---------- | ------ | -------------------------------------------- | -------------- | ------------- | ---------------- | | ||
| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this metric necessary?
assuming brokers report unique instance id (e.g. standard service.instance.id
attribute), it can be derived from other metrics in this list. If brokers also report standard metrics (CPU, memory, etc), this can also be derived from them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be true, this is primarily here to replace the kafka.brokers
metric which already exists. So doing away with this completely might impact metrics that are being collected now, although I do see the point that it could be derived if we added the attribute.
docs/messaging/kafka.md
Outdated
| Name | Instrument | Value type | Unit | Unit ([UCUM](/docs/general/metrics.md#instrument-units)) | Description | Attribute Key | Attribute Values | | ||
| ---------------------------------------------| ------------- | ---------- | ------ | -------------------------------------------- | -------------- | ------------- | ---------------- | | ||
| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | | | ||
| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If brokers can report messaging.kafka.brokers.consumer.fetch.count
, it can provide more information and the rate can be derived from it - can it be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I thought that it was preferred not to have rate metrics but instead use a counter from which the rate can be derived for whatever time period is desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah I think that is correct. This should be updated.
docs/messaging/kafka.md
Outdated
| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` | | ||
| messaging.kafka.brokers.network.io | Counter | Int64 | bytes | `By` | The bytes received or sent by the broker. | | | | ||
| messaging.kafka.brokers.requests.latency | Gauge | Double | ms | `{ms}` | Average Request latency in ms. | | | | ||
| messaging.kafka.brokers.requests.rate | Gauge | Double | requests per second | `{request}/s`| Average request rate per second. | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, messaging.kafka.brokers.request.count
is more flexible and the rate can be derived from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed.
docs/messaging/kafka.md
Outdated
| messaging.kafka.brokers.network.io | Counter | Int64 | bytes | `By` | The bytes received or sent by the broker. | | | | ||
| messaging.kafka.brokers.requests.latency | Gauge | Double | ms | `{ms}` | Average Request latency in ms. | | | | ||
| messaging.kafka.brokers.requests.rate | Gauge | Double | requests per second | `{request}/s`| Average request rate per second. | | | | ||
| messaging.kafka.brokers.requsts.size | Gauge | Double | bytes | `By` | Average request size in bytes. | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would probably be best represented with histogram to allow distribution and then a rate counter is not needed.
Another possibility would be to only report messaging.kafka.brokers.network.io
with direction attribute or use messaging.kafka.brokers.request.bytes
counter counting all the bytes .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lmolkova , is there an example of a histogram metric available and how that would be defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can find examples throughout this repo, for example, take a look at http.client.request.duration
. there is a lot of information in the spec repo and also take a look at metric docs on opentelemetry.io
docs/messaging/kafka.md
Outdated
| messaging.kafka.brokers.requests.latency | Gauge | Double | ms | `{ms}` | Average Request latency in ms. | | | | ||
| messaging.kafka.brokers.requests.rate | Gauge | Double | requests per second | `{request}/s`| Average request rate per second. | | | | ||
| messaging.kafka.brokers.requsts.size | Gauge | Double | bytes | `By` | Average request size in bytes. | | | | ||
| messaging.kafka.brokers.responses.rate | Gauge | Double | responses per second| `{response}/s`| Average response rate per second. | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as on request rate
docs/messaging/kafka.md
Outdated
| messaging.kafka.brokers.requests.rate | Gauge | Double | requests per second | `{request}/s`| Average request rate per second. | | | | ||
| messaging.kafka.brokers.requsts.size | Gauge | Double | bytes | `By` | Average request size in bytes. | | | | ||
| messaging.kafka.brokers.responses.rate | Gauge | Double | responses per second| `{response}/s`| Average response rate per second. | | | | ||
| messaging.kafka.brokers.response_size | Gauge | Double | bytes | `By` | Average response size in bytes. | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as on request size
docs/messaging/kafka.md
Outdated
| messaging.kafka.brokers.requsts.size | Gauge | Double | bytes | `By` | Average request size in bytes. | | | | ||
| messaging.kafka.brokers.responses.rate | Gauge | Double | responses per second| `{response}/s`| Average response rate per second. | | | | ||
| messaging.kafka.brokers.response_size | Gauge | Double | bytes | `By` | Average response size in bytes. | | | | ||
| messaging.kafka.brokers.requests.in.flight | Gauge | Int64 | requests | `{request}` | Requests in flight. | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe messaging.kafka.brokers.active_requests
to stay consistent with HTTP semantic conventions.
docs/messaging/kafka.md
Outdated
|
||
**Description:** Kafka Broker level metrics. | ||
|
||
| Name | Instrument | Value type | Unit | Unit ([UCUM](/docs/general/metrics.md#instrument-units)) | Description | Attribute Key | Attribute Values | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any attributes - are there any? We should have at least standard service.*
ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest using the broker's id (which I would call "node id" instead of "broker id" following Kafka best practice).
docs/messaging/kafka.md
Outdated
| Name | Instrument | Value type | Unit | Unit ([UCUM](/docs/general/metrics.md#instrument-units)) | Description | Attribute Key | Attribute Values | | ||
| ---------------------------------------------| ------------- | ---------- | ------ | -------------------------------------------- | -------------- | ------------- | ---------------- | | ||
| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | | | ||
| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I thought that it was preferred not to have rate metrics but instead use a counter from which the rate can be derived for whatever time period is desired.
docs/messaging/kafka.md
Outdated
| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | | | ||
| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` | | ||
| messaging.kafka.brokers.network.io | Counter | Int64 | bytes | `By` | The bytes received or sent by the broker. | | | | ||
| messaging.kafka.brokers.requests.latency | Gauge | Double | ms | `{ms}` | Average Request latency in ms. | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I would much prefer to see this as a histogram. Even better, in my opinion, would be an exponential histogram.
docs/messaging/kafka.md
Outdated
| ---------------------------------------------| ------------- | ---------- | ------ | -------------------------------------------- | -------------- | ------------- | ---------------- | | ||
| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | | | ||
| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` | | ||
| messaging.kafka.brokers.network.io | Counter | Int64 | bytes | `By` | The bytes received or sent by the broker. | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we want to measure bytes sent and bytes received separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the original thought, but it was changed. I think changing it back makes sense, the PR I have for the kafka collector is separated.
docs/messaging/kafka.md
Outdated
|
||
**Description:** Kafka Broker level metrics. | ||
|
||
| Name | Instrument | Value type | Unit | Unit ([UCUM](/docs/general/metrics.md#instrument-units)) | Description | Attribute Key | Attribute Values | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest using the broker's id (which I would call "node id" instead of "broker id" following Kafka best practice).
Hi @AndrewJSchofield @lmolkova @dmitryax . I made some updates and raised the commit. It looks correct to me now, let me know if I've missed anything. |
One other follow up, I don't think that node and broker are interchangeable, and since this is specific to broker metrics I'm inclined to say we should leave it as broker.id. I did add the attribute to everything except broker count. |
Sorry for a delayed response to this. This is for a very specific integration which uses sarama metrics, and it has some limitations. Since that is the case, I'm not sure that making significant updates to these metrics makes sense, because they aren't possible with the one integration so far that will be using them. The broker related metrics for example don't use |
In a semantic conventions SIG meeting an agreement was reached to remove Kafka broker metrics from the semantic conventions. See #338, which removes the PR and gives a list of reasons. @jcountsNR, if you have any opinions please weigh in on #338. Otherwise let's close this PR when/if #338 is merged. |
@pyohannes I'm good to close it, I don't think it's valid anymore. |
No description provided.