Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMinion v2.2.1 is segfaulting when serving /metrics #177

Closed
hhromic opened this issue Dec 9, 2022 · 4 comments · Fixed by #180
Closed

KMinion v2.2.1 is segfaulting when serving /metrics #177

hhromic opened this issue Dec 9, 2022 · 4 comments · Fixed by #180

Comments

@hhromic
Copy link
Contributor

hhromic commented Dec 9, 2022

First of all, thank you very much for KMinion. It is truly a useful diagnostics tool!

I just noticed a new version of KMinion released (v2.2.1) and I went to test it to migrate our existing v2.2.0 deployment.
Unfortunately, the new version of KMinion is segfaulting whenever a client requests its /metrics endpoint.
I tested this with two different Confluent Kafka clusters running version 6.2.7 and get the same result.

In one terminal, KMinion is started like this:

$ docker run --rm -p 8080:8080 -e KAFKA_BROKERS=kafka:9092 vectorized/kminion:v2.2.1
{"level":"info","ts":1670591408.6039164,"caller":"app/config.go:62","msg":"the env variable 'CONFIG_FILEPATH' is not set, therefore no YAML config will be loaded"}
{"level":"info","ts":"2022-12-09T13:10:08.604Z","logger":"main","msg":"started kminion","version":"v2.2.1","built_at":"2022-11-24"}
{"level":"info","ts":"2022-12-09T13:10:08.604Z","logger":"main","msg":"connecting to Kafka seed brokers, trying to fetch cluster metadata","seed_brokers":"kafka:9092"}
{"level":"info","ts":"2022-12-09T13:10:10.248Z","logger":"main","msg":"successfully connected to kafka cluster"}

When the /metrics endpoint is requested in another terminal, KMinion crashes after few seconds with the below error:

$ curl localhost:8080/metrics
curl: (52) Empty reply from server

Error on the KMinion terminal:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x7de84d]

goroutine 73 [running]:
github.com/twmb/franz-go/pkg/kgo.(*describeGroupsSharder).shard(0x800?, {0xcf0750?, 0xc000460100?}, {0xcf4828?, 0xc00001a9c0}, {0x7f2688649fff?, 0x429ea5?})
        /go/pkg/mod/github.com/twmb/franz-go@v1.10.0/pkg/kgo/client.go:2544 +0xa2d
github.com/twmb/franz-go/pkg/kgo.(*Client).handleShardedReq.func2({0x0, {0xcf4828, 0xc00001a9c0}, {0x0, 0x0}})
        /go/pkg/mod/github.com/twmb/franz-go@v1.10.0/pkg/kgo/client.go:1730 +0x151
github.com/twmb/franz-go/pkg/kgo.(*Client).handleShardedReq(0xc000278000, {0xcf0750?, 0xc000460100}, {0xcf4828?, 0xc00001a9c0})
        /go/pkg/mod/github.com/twmb/franz-go@v1.10.0/pkg/kgo/client.go:1821 +0x9f9
github.com/twmb/franz-go/pkg/kgo.(*Client).shardedRequest(0xc000278000, {0xcf07f8?, 0xc00011ac60?}, {0xcf4828?, 0xc00001a9c0})
        /go/pkg/mod/github.com/twmb/franz-go@v1.10.0/pkg/kgo/client.go:978 +0x691
github.com/twmb/franz-go/pkg/kgo.(*Client).RequestSharded(0xc000494000?, {0xcf07f8?, 0xc00011ac60?}, {0xcf4828?, 0xc00001a9c0?})
        /go/pkg/mod/github.com/twmb/franz-go@v1.10.0/pkg/kgo/client.go:909 +0x3a
github.com/cloudhut/kminion/v2/minion.(*Service).DescribeConsumerGroups(0xc000494000, {0xcf07f8, 0xc00011ac60})
        /app/minion/describe_consumer_groups.go:70 +0x15b
github.com/cloudhut/kminion/v2/prometheus.(*Exporter).collectConsumerGroups(0xc000360000, {0xcf07f8?, 0xc00011ac60?}, 0xcea010?)
        /app/prometheus/collect_consumer_groups.go:18 +0x5e
github.com/cloudhut/kminion/v2/prometheus.(*Exporter).Collect(0xc000360000, 0xc000392f60?)
        /app/prometheus/exporter.go:235 +0x205
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
        /go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/registry.go:456 +0x10d
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
        /go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/registry.go:467 +0x55d

Please let me know of any information you might need to help debugging this issue.

@weeco
Copy link
Contributor

weeco commented Dec 9, 2022

Hey @hhromic ,
thanks for filing the issue. We are not able to reproduce this issue and we are using this KMinion version for several Redpanda clusters.

Could you share the configuration you are using? Are you able to provide a docker compose where this is reproducible? Thanks!

@hhromic
Copy link
Contributor Author

hhromic commented Dec 12, 2022

Hey @weeco , thanks for the fast reply and sorry for not replying earlier.
I tried to construct a minimum reproducible example in a simple Docker compose file but (of course) all is working fine on a fresh Kafka cluster deployment. This crash is likely a corner case in our current Kafka cluster state.

I will try to debug a local copy of kminion given on how easy is for me to reproduce the crash with our current Kafka clusters. The crash happens actually in franz-go and not in kminion itself, so I likely will start debugging there first.

Maybe a fix is due for franz-go and kminion would just need to bump the dependency on that library. Will keep you posted on what I manage to find. Cheers!

@hhromic
Copy link
Contributor Author

hhromic commented Dec 12, 2022

Good news, I found the problem now. Confluent's Control Center has the (bad) habit of using a consumer group with an empty-string as name/ID. franz-go is crashing on this for-loop when it encounters this empty-named group:

    for _, group := range req.Groups {
        berr := coordinators[group]
        var ke *kerr.Error
        switch {
        case berr.err == nil:
            brokerReq := brokerReqs[berr.b.meta.NodeID]   // crash is here
            if brokerReq == nil {
                brokerReq = newReq()
                brokerReqs[berr.b.meta.NodeID] = brokerReq
            }
            brokerReq.Groups = append(brokerReq.Groups, group)
        case errors.As(berr.err, &ke):
            kerrs[ke] = append(kerrs[ke], group)
        default:
            unkerrs = append(unkerrs, unkerr{berr.err, group})
        }
    }

I can confirm that when I delete the empty-string named consumer group, kminion does not crash anymore.
This is obviously a bug in franz-go and not kminion, therefore I will continue this issue in their repository.
I will leave this issue here open to notify you when this is fixed and released upstream.

@hhromic
Copy link
Contributor Author

hhromic commented Dec 13, 2022

@weeco upstream franz-go fixed the issue now (very quickly) :)
I re-built KMinion locally with latest franz-go version v1.10.4 and I can confirm no more crashes now with our cluster :)
All is working as expected again:

$ curl -s localhost:8080/metrics | grep -F 'group_id=""'
kminion_kafka_consumer_group_info{coordinator_id="2",group_id="",protocol="",protocol_type="",state="Empty"} 0
kminion_kafka_consumer_group_members{group_id=""} 0
kminion_kafka_consumer_group_topic_lag{group_id="",topic_name="test-topic"} 0
kminion_kafka_consumer_group_topic_offset_sum{group_id="",topic_name="test-topic"} 5.478263e+06
kminion_kafka_consumer_group_topic_partition_lag{group_id="",partition_id="0",topic_name="test-topic"} 0

So all is needed now should be for kminion to bump franz-go and this should be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants