Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingester's compatibility with kafka 2.3 #1819

Closed
chopper-poet opened this issue Sep 26, 2019 · 14 comments
Closed

ingester's compatibility with kafka 2.3 #1819

chopper-poet opened this issue Sep 26, 2019 · 14 comments
Labels
help wanted Features that maintainers are willing to accept but do not have cycles to implement storage/kafka

Comments

@chopper-poet
Copy link

When ingester has more than one instance,it will continuous display those messages.
image

Here are env info:
ingester:1.14.0
kafka:2.3

Here are error msgs:
====ingester====
2019/09/21 10:42:18 maxprocs: Leaving GOMAXPROCS=1: CPU quota undefined
WARNING: found deprecated command line option --span-storage.type=elasticsearch, please use environment variable SPAN_STORAGE_TYPE instead
{"level":"info","ts":1569033738.0063226,"caller":"flags/service.go:115","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1569033738.006453,"caller":"flags/admin.go:108","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1569033738.006529,"caller":"flags/admin.go:114","msg":"Starting admin HTTP server","http-port":14270}
{"level":"info","ts":1569033738.006546,"caller":"flags/admin.go:100","msg":"Admin server started","http-port":14270,"health-status":"unavailable"}
{"level":"info","ts":1569033738.023475,"caller":"healthcheck/handler.go:129","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1569033738.023505,"caller":"consumer/consumer.go:78","msg":"Starting main loop"}
{"level":"info","ts":1569033738.151742,"caller":"consumer/consumer.go:169","msg":"Starting error handler","partition":0}
{"level":"info","ts":1569033738.1518486,"caller":"consumer/consumer.go:112","msg":"Starting message handler","partition":0}
{"level":"info","ts":1569033741.4189322,"caller":"consumer/consumer.go:180","msg":"Finished handling errors","partition":0}
{"level":"info","ts":1569033741.4189994,"caller":"consumer/consumer.go:138","msg":"Message channel closed. ","partition":0}
{"level":"info","ts":1569033741.4190173,"caller":"consumer/consumer.go:162","msg":"Closing partition consumer","partition":0}
{"level":"info","ts":1569033741.4191213,"caller":"consumer/consumer.go:165","msg":"Closed partition consumer","partition":0}
{"level":"info","ts":1569033741.8808475,"caller":"consumer/consumer.go:169","msg":"Starting error handler","partition":0}
{"level":"info","ts":1569033741.8809078,"caller":"consumer/consumer.go:112","msg":"Starting message handler","partition":0}
{"level":"info","ts":1569033744.2168658,"caller":"consumer/processor_factory.go:62","msg":"Creating new processors","partition":0}
{"level":"info","ts":1569033745.9766564,"caller":"consumer/consumer.go:180","msg":"Finished handling errors","partition":0}
{"level":"info","ts":1569033745.9767509,"caller":"consumer/consumer.go:138","msg":"Message channel closed. ","partition":0}
{"level":"info","ts":1569033745.9780123,"caller":"processor/parallel_processor.go:78","msg":"Completed shutdown of processor goroutines"}
{"level":"info","ts":1569033745.9780428,"caller":"consumer/consumer.go:162","msg":"Closing partition consumer","partition":0}
{"level":"info","ts":1569033745.9780908,"caller":"consumer/consumer.go:165","msg":"Closed partition consumer","partition":0}
{"level":"info","ts":1569033746.7944613,"caller":"consumer/consumer.go:169","msg":"Starting error handler","partition":0}
{"level":"info","ts":1569033746.7945256,"caller":"consumer/consumer.go:112","msg":"Starting message handler","partition":0}
{"level":"info","ts":1569033750.0735219,"caller":"consumer/consumer.go:180","msg":"Finished handling errors","partition":0}
{"level":"info","ts":1569033750.07358,"caller":"consumer/consumer.go:138","msg":"Message channel closed. ","partition":0}
{"level":"info","ts":1569033750.0735936,"caller":"consumer/consumer.go:162","msg":"Closing partition consumer","partition":0}
{"level":"info","ts":1569033750.0736668,"caller":"consumer/consumer.go:165","msg":"Closed partition consumer","partition":0}
{"level":"info","ts":1569033750.5355024,"caller":"consumer/consumer.go:169","msg":"Starting error handler","partition":0}
{"level":"info","ts":1569033750.535565,"caller":"consumer/consumer.go:112","msg":"Starting message handler","partition":0}
{"level":"info","ts":1569033753.813964,"caller":"consumer/consumer.go:180","msg":"Finished handling errors","partition":0}
{"level":"info","ts":1569033753.8140304,"caller":"consumer/consumer.go:138","msg":"Message channel closed. ","partition":0}
{"level":"info","ts":1569033753.8140442,"caller":"consumer/consumer.go:162","msg":"Closing partition consumer","partition":0}
{"level":"info","ts":1569033753.8141136,"caller":"consumer/consumer.go:165","msg":"Closed partition consumer","partition":0}
{"level":"info","ts":1569033754.2764068,"caller":"consumer/consumer.go:169","msg":"Starting error handler","partition":0}
{"level":"info","ts":1569033754.2764654,"caller":"consumer/consumer.go:112","msg":"Starting message handler","partition":0}
^C{"level":"info","ts":1569033755.513566,"caller":"flags/service.go:145","msg":"Shutting down"}
{"level":"info","ts":1569033755.5136282,"caller":"healthcheck/handler.go:129","msg":"Health Check state change","status":"unavailable"}
{"level":"info","ts":1569033755.5136454,"caller":"consumer/consumer.go:162","msg":"Closing partition consumer","partition":0}
{"level":"info","ts":1569033755.5326853,"caller":"consumer/consumer.go:180","msg":"Finished handling errors","partition":0}
{"level":"info","ts":1569033755.5327137,"caller":"consumer/consumer.go:138","msg":"Message channel closed. ","partition":0}
{"level":"info","ts":1569033755.532727,"caller":"consumer/consumer.go:162","msg":"Closing partition consumer","partition":0}
{"level":"info","ts":1569033755.5328152,"caller":"consumer/consumer.go:165","msg":"Closed partition consumer","partition":0}
{"level":"info","ts":1569033755.5328388,"caller":"consumer/consumer.go:165","msg":"Closed partition consumer","partition":0}
{"level":"info","ts":1569033755.53285,"caller":"consumer/consumer.go:107","msg":"Closing parent consumer"}
{"level":"info","ts":1569033755.5365846,"caller":"flags/service.go:153","msg":"Shutdown complete"}
You have new mail in /var/spool/mail/root

====ingester====

====kafka server.log===
[2019-09-21 10:42:08,286] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1689 (__consumer_offsets-38) (reason: Adding new member jaeger-ingester-e9c21519-a721-4745-aa80-c6d2aee64231 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:08,290] INFO [GroupCoordinator 188]: Stabilized group jaeger-ingester-20190906 generation 1690 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:08,293] INFO [GroupCoordinator 188]: Assignment received from leader for group jaeger-ingester-20190906 for generation 1690 (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:18,130] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1690 (__consumer_offsets-38) (reason: Adding new member jaeger-ingester-0eb85261-1de8-4f6f-b53a-29a2944c5765 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:18,147] INFO [GroupCoordinator 188]: Stabilized group jaeger-ingester-20190906 generation 1691 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:18,148] INFO [GroupCoordinator 188]: Assignment received from leader for group jaeger-ingester-20190906 for generation 1691 (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:21,124] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1691 (__consumer_offsets-38) (reason: Adding new member jaeger-ingester-a30be48c-7c7c-4011-9c83-071ee7140cb6 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:21,134] INFO [GroupCoordinator 188]: Stabilized group jaeger-ingester-20190906 generation 1692 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:21,136] INFO [GroupCoordinator 188]: Assignment received from leader for group jaeger-ingester-20190906 for generation 1692 (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:21,876] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1692 (__consumer_offsets-38) (reason: Adding new member jaeger-ingester-0bb14181-b214-476f-a77e-cab9dbaa08d7 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:21,877] INFO [GroupCoordinator 188]: Stabilized group jaeger-ingester-20190906 generation 1693 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:21,877] INFO [GroupCoordinator 188]: Assignment received from leader for group jaeger-ingester-20190906 for generation 1693 (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:24,674] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1693 (__consumer_offsets-38) (reason: Adding new member jaeger-ingester-e1043d6a-0134-4ad3-af1f-c0285b355e0d with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:24,675] INFO [GroupCoordinator 188]: Stabilized group jaeger-ingester-20190906 generation 1694 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:24,676] INFO [GroupCoordinator 188]: Assignment received from leader for group jaeger-ingester-20190906 for generation 1694 (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:26,790] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1694 (__consumer_offsets-38) (reason: Adding new member jaeger-ingester-7231aa34-d430-42d7-bfd0-ffbcb87de626 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:26,791] INFO [GroupCoordinator 188]: Stabilized group jaeger-ingester-20190906 generation 1695 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:26,791] INFO [GroupCoordinator 188]: Assignment received from leader for group jaeger-ingester-20190906 for generation 1695 (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:28,408] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1695 (__consumer_offsets-38) (reason: Adding new member jaeger-ingester-cc9fa2c5-cbbf-4b11-956a-a2b2baf51513 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:28,408] INFO [GroupCoordinator 188]: Stabilized group jaeger-ingester-20190906 generation 1696 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:28,437] INFO [GroupCoordinator 188]: Assignment received from leader for group jaeger-ingester-20190906 for generation 1696 (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:30,531] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1696 (__consumer_offsets-38) (reason: Adding new member jaeger-ingester-56a13689-d051-4e97-8819-1dd028a2dff3 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:30,532] INFO [GroupCoordinator 188]: Stabilized group jaeger-ingester-20190906 generation 1697 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:30,532] INFO [GroupCoordinator 188]: Assignment received from leader for group jaeger-ingester-20190906 for generation 1697 (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:32,178] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1697 (__consumer_offsets-38) (reason: Adding new member jaeger-ingester-05ed8b29-cda5-4fb4-a730-bb206d25f66a with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:32,179] INFO [GroupCoordinator 188]: Stabilized group jaeger-ingester-20190906 generation 1698 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:32,180] INFO [GroupCoordinator 188]: Assignment received from leader for group jaeger-ingester-20190906 for generation 1698 (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:34,270] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1698 (__consumer_offsets-38) (reason: Adding new member jaeger-ingester-0b4bcbb0-5cbc-49f1-a93d-fc0b211e9000 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:34,271] INFO [GroupCoordinator 188]: Stabilized group jaeger-ingester-20190906 generation 1699 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:34,272] INFO [GroupCoordinator 188]: Assignment received from leader for group jaeger-ingester-20190906 for generation 1699 (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:35,534] INFO [GroupCoordinator 188]: Member jaeger-ingester-0b4bcbb0-5cbc-49f1-a93d-fc0b211e9000 in group jaeger-ingester-20190906 has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:35,534] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1699 (__consumer_offsets-38) (reason: removing member jaeger-ingester-0b4bcbb0-5cbc-49f1-a93d-fc0b211e9000 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:35,534] INFO [GroupCoordinator 188]: Group jaeger-ingester-20190906 with generation 1700 is now empty (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:35,914] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1700 (__consumer_offsets-38) (reason: Adding new member jaeger-ingester-1f04fc99-d91b-4559-a1d5-507bdaa0010d with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:35,915] INFO [GroupCoordinator 188]: Stabilized group jaeger-ingester-20190906 generation 1701 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:35,916] INFO [GroupCoordinator 188]: Assignment received from leader for group jaeger-ingester-20190906 for generation 1701 (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:39,191] INFO [GroupCoordinator 188]: Member jaeger-ingester-1f04fc99-d91b-4559-a1d5-507bdaa0010d in group jaeger-ingester-20190906 has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:39,192] INFO [GroupCoordinator 188]: Preparing to rebalance group jaeger-ingester-20190906 in state PreparingRebalance with old generation 1701 (__consumer_offsets-38) (reason: removing member jaeger-ingester-1f04fc99-d91b-4559-a1d5-507bdaa0010d on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:42:39,192] INFO [GroupCoordinator 188]: Group jaeger-ingester-20190906 with generation 1702 is now empty (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2019-09-21 10:47:53,329] INFO [GroupMetadataManager brokerId=188] Removed 0 expired offsets in 11 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
====kafka server.log===

The rebalance is too frequently too Unbelievable.

ps:when use kafka 2.2.1 ,that's ok.

@chopper-poet
Copy link
Author

@yurishkuro @pavolloffay
please give it a little attention,thks.

@yurishkuro yurishkuro added the help wanted Features that maintainers are willing to accept but do not have cycles to implement label Oct 13, 2019
@vprithvi
Copy link
Contributor

This is a duplicate of #1774

@vprithvi
Copy link
Contributor

Apart from the logs(which look normal), what user facing issue does this cause?

@chopper-poet
Copy link
Author

Apart from the logs(which look normal), what user facing issue does this cause?

The rebalance is too frequently too Unbelievable.

@yurishkuro
Copy link
Member

it looks to me that rebalances are initiated by the Kafka server. The only condition that I know of when the ingester would cause a rebalance is if its deadlock detector kicks in and closes a partition due to inactivity, but I don't see those logs in the output.

@yurishkuro
Copy link
Member

I wonder if newer Kafka implements some heartbeat to the consumers which is not supported by sarama driver, and that might cause the server to rebalance consumers.

@vprithvi
Copy link
Contributor

Also, it would be useful to have the following information

  1. How frequently do you see these rebalances?
  2. What is the read throughput?
  3. What is the resource utilization on your containers? Are they oversubscribed?
  4. What is the topic configuration?

@chopper-poet
Copy link
Author

Also, it would be useful to have the following information

  1. How frequently do you see these rebalances?
  2. What is the read throughput?
  3. What is the resource utilization on your containers? Are they oversubscribed?
  4. What is the topic configuration?

A1:3s
A2-4:container load is normal

@kjetijor
Copy link

We noticed the same with multiple instances of jaeger-ingester consuming from kafka. Partition ownership seems to cycle through being owned by the individual ingesters, this may happen before offsets have been committed, which may lead to "some amount" of re-consumption of the same messages. For us - dropping down to single ingester consuming from kafka seems to have mostly solved this for us.

Also - sarama-cluster seems to have been deprecated in favor of features merged into sarama.

@chopper-poet
Copy link
Author

We noticed the same with multiple instances of jaeger-ingester consuming from kafka. Partition ownership seems to cycle through being owned by the individual ingesters, this may happen before offsets have been committed, which may lead to "some amount" of re-consumption of the same messages. For us - dropping down to single ingester consuming from kafka seems to have mostly solved this for us.

Also - sarama-cluster seems to have been deprecated in favor of features merged into sarama.

@yurishkuro @vprithvi
please notice it.

@yurishkuro
Copy link
Member

Once again, the rebalance of partitions is initiated by the Kafka brokers. Broker logs don't show why it might be happening, so we cannot diagnose it. It could be the driver issue (sarama-cluster), or the fact that ingesters do not commit offsets immediately (we have outstanding ticket #1109 to look into that).

Upgrading to the latest sarama might help, if someone wants to give it a try.

@pavolloffay
Copy link
Member

@rubenvp8510 can we close this issue?

Your PR #2009 (comment) comments that the issue has been fixed in Kafka 2.3.1

It seems that the bug with Kafka was fixed with 2.3.1. I prefer that we go with option (3) for now.

@rubenvp8510
Copy link
Contributor

@pavolloffay Yes, we can close it.

@pavolloffay
Copy link
Member

thanks @rubenvp8510

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Features that maintainers are willing to accept but do not have cycles to implement storage/kafka
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants