Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract consumer group member_count as a separate metric #105

Merged
merged 1 commit into from
Oct 12, 2021

Conversation

amuraru
Copy link
Contributor

@amuraru amuraru commented Jul 31, 2021

Added a new consumer_group_members metric to capture number of members
in a consumer group and removed the member_count label from consumer_group_info

Fixes #103

@amuraru amuraru force-pushed the issue-103 branch 2 times, most recently from 06332a8 to d88a513 Compare August 1, 2021 22:56
group.Protocol,
group.ProtocolType,
group.State,
strconv.FormatInt(int64(coordinator), 10),
)
// total number of members in consumer groups
if len(group.Members) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain that condition? Why don't we report 0 for groups that don't have any active members?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughrt we can reduce no of metrics with this.
Empty groups can alearedy be monitored with consumer_group_info{state="Empty"}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm afraid of is when you monitor your consumer group member count in a graph that you may experience gaps or inaccurate results due to the staleness of these specific groups. For instance imagine a job that does certain things periodically (like catching up the topic lag and then go to sleep for 10minutes until it will repeat). That graph would look weird / inaccurate if we suddenly stop reporting a number for it when it switches the state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core reason for avoiding exposing 0 members in the group is to reduce the amount of samples scraped by Prometheus, Martin.
As you know, Prometheus holds recently added samples for active time series in RAM, so its RAM usage highly depends on the number of active time series. On the flip-side, having sparse time series can lead to high churn in Prometheus when scraping. A
Now, when it comes to consumer groups a consumer group can have 0 members when the group is in Empty or Dead that usually last ~7 days since the consumer group was abandoned by an application. Reporting metrics for such long period of time is useless in my opinion and could put extra pressure on Prometheus (both scrape but also memory) and that'w why I preferred to avoid sending 0 values metrics.

To your example, with batch recurring job using the consumer group - the graph (if using Grafana) can be "corrected" using null-to-0 functions

wdyt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw this is the same reasoning as in #102 :)
To give options to kminion users I could try to add a new config parameter, say activeGroupsOnly (similar to allow/deny list) that can control reporting these metrics for all (default) or only active (Stable) groups

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you say makes sense and I'm also often worried about high RAM usage. In that case I believe it can be neglected as the number of time series would only scale with the number of consumer groups. You said you have a lot of consumer groups, which I'd interpret as 1k-3k? I'd assume the majority of these are in a stable state then.

I think few thousand additional metric series can be neglected and I'd rather have kminion report all groups' member count (especially because users possibly would expect them for all consumer groups and not just for certain states).

If we don't want to make a trade off we could introduce further configurations which would allow users to configure what they are interested in. Some may not be interested in consumer group member count as well. I'm fine with adding configuration options as kminion is supposed to be a flexible and very featue rich prometheus exporter for Kafka

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said you have a lot of consumer groups, which I'd interpret as 1k-3k? I'd assume the majority of these are in a stable state then.

I fact in our environment I have ~30k consumer groups. And only ~10% are active/stable - the rest are Empty as they are one-off consumer groups created by some ephemeral jobs that create unique consumer group names.

I'll remove the condition here and file a new issue to control this behaviour through a well defined configuration parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #108

@amuraru
Copy link
Contributor Author

amuraru commented Sep 1, 2021

@weeco kind reminder to review this PR. thanks :)

Added a new consumer_group_members topic to capture number of members
in a consumer group and removed the `member_count` label from consumer_group_info

Fixes redpanda-data#103
@amuraru
Copy link
Contributor Author

amuraru commented Oct 11, 2021

@weeco Rebased on master branch and fixed the conflicts.
Would you please review this?

@weeco
Copy link
Contributor

weeco commented Oct 12, 2021

Thanks for your PR and the patience @amuraru

@weeco weeco merged commit 1dd0306 into redpanda-data:master Oct 12, 2021
@amuraru amuraru deleted the issue-103 branch October 12, 2021 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract consumer group member_count as a separate metric
2 participants