[SPARK-26350][SS]Allow to override group id of the Kafka consumer #23301

zsxwing · 2018-12-12T22:32:23Z

What changes were proposed in this pull request?

This PR allows the user to override kafka.group.id for better monitoring or security. The user needs to make sure there are not multiple queries or sources using the same group id.

It also fixes a bug that the groupIdPrefix option cannot be retrieved.

How was this patch tested?

The new added unit tests.

zsxwing · 2018-12-12T22:33:30Z

cc @tdas

SparkQA · 2018-12-12T23:03:39Z

Test build #100048 has finished for PR 23301 at commit fccfee1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-12T23:49:19Z

Test build #100050 has finished for PR 23301 at commit bd1c0b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR

Looks good overall. Some nits, and might better to have better test if possible.

HeartSaVioR · 2018-12-13T01:19:14Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

-        s"Kafka option '${ConsumerConfig.GROUP_ID_CONFIG}' is not supported as " +
-          s"user-specified consumer groups are not used to track offsets.")
+      logWarning(
+        s"It is not recommended to set Kafka option 'kafka.${ConsumerConfig.GROUP_ID_CONFIG}'. " +


The long string for log message is duplicated 3 times. Could we move this and reuse it like KafkaSourceProvider.INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_FALSE and KafkaSourceProvider.INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_TRUE?

Good point. Updated.

HeartSaVioR · 2018-12-13T01:23:34Z

docs/structured-streaming-kafka-integration.md

-set the prefix of the automatically generated group.id's via the optional source option `groupIdPrefix`, default value
-is "spark-kafka-source".
+set the prefix of the automatically generated group.id's via the optional source option `groupIdPrefix`,
+default value is "spark-kafka-source". You can also set "kafka.group.id" to force Spark to use a special


nit: same here

HeartSaVioR · 2018-12-13T01:23:55Z

docs/structured-streaming-kafka-integration.md

@@ -379,7 +379,25 @@ The following configurations are optional:
  <td>string</td>
  <td>spark-kafka-source</td>
  <td>streaming and batch</td>
-  <td>Prefix of consumer group identifiers (`group.id`) that are generated by structured streaming queries</td>
+  <td>Prefix of consumer group identifiers (`group.id`) that are generated by structured streaming
+  queries. If "kafka.group.id" is set, this option will be ignored. </td>


nit: Given that other option is wrapped with `, might better to follow same rule for consistency.

nit: Given that other option is wrapped with `, might better to follow same rule for consistency.

We don't have such rule. See the doc of failOnDataLoss

Yup. I think I chose word incorrectly. Many options are wrapped with ` so felt we are having implicit rule on that. please ignore if the approach on representation is already not consistent.

HeartSaVioR · 2018-12-13T01:24:17Z

docs/structured-streaming-kafka-integration.md

+  unexpected behavior. Concurrently running queries (both, batch and streaming) or sources with the
+  same group id are likely interfere with each other causing each query to read only part of the
+  data. This may also occur when queries are started/restarted in quick succession. To minimize such
+  issues, set the Kafka consumer session timeout (by setting option "kafka.session.timeout.ms") to


nit: same here and below line

HeartSaVioR · 2018-12-13T01:28:37Z

...kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala

+      .as[String]
+      .map(_.toInt)
+
+    testStream(dsKafka)(


Looks like we just run query and see whether it works with fixed group id, and I guess result is actually not affected whether the option is applied or not.

Is there any way to verify whether the group.id value is properly set to Kafka parameter? We could ignore if there's no way to get it.

Yeah, we don't have an api to check this.

Thanks for explaining. Then looks OK to me.

HeartSaVioR · 2018-12-13T01:29:19Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala

+    testUtils.sendMessages(topic, (21 to 30).map(_.toString).toArray, Some(2))
+
+    val df = createDF(topic, withOptions = Map("kafka.group.id" -> "custom"))
+    checkAnswer(df, (1 to 30).map(_.toString).toDF())


HeartSaVioR

LGTM

SparkQA · 2018-12-14T01:29:24Z

Test build #100115 has finished for PR 23301 at commit c279fd5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-12-16T05:34:55Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReadSupport.scala

-      reportDataLoss(s"$deletedPartitions are gone. Some data may have been missed")
+      val message =
+        if (kafkaOffsetReader.driverKafkaParams.containsKey(ConsumerConfig.GROUP_ID_CONFIG)) {
+          s"$deletedPartitions are gone. " + KafkaSourceProvider.CUSTOM_GROUP_ID_ERROR_MESSAGE


nit .. I would use string interpolation tho.

Please ignore this if other changes are ready. It just bugged me while reading the codes.

Personally either is fine for me (use + or string interpolation), but in this case it might be better for error message to be added to former string since former strong is already using string interpolation.

SparkQA · 2018-12-17T19:30:38Z

Test build #100256 has finished for PR 23301 at commit d9d3065.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2018-12-21T18:56:45Z

retest this please

SparkQA · 2018-12-21T19:25:27Z

Test build #100367 has finished for PR 23301 at commit d9d3065.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-01-10T17:44:33Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

@@ -534,7 +546,7 @@ private[kafka010] object KafkaSourceProvider extends Logging {
      parameters: Map[String, String],
      metadataPath: String): String = {
    val groupIdPrefix = parameters
-      .getOrElse("groupIdPrefix", "spark-kafka-source")
+      .getOrElse(GROUP_ID_PREFIX, "spark-kafka-source")


Here the parameters map is not lowercased but GROUP_ID_PREFIX is lower.

Yeah, this is actually a fix. org.apache.spark.sql.sources.v2.DataSourceOptions.asMap will return a map that all keys are lower case.

I see now, it was not clear from the PR description.

Good point. I updated the description.

gaborgsomogyi

LGTM.

tdas · 2019-01-11T10:34:37Z

docs/structured-streaming-kafka-integration.md

+  source has its own consumer group that does not face interference from any other consumer, and
+  therefore can read all of the partitions of its subscribed topics. In some scenarios (for example,
+  Kafka group-based authorization), you may want to use a specific authorized group id to read data.
+  You can optionally set the group ID. However, do this with extreme caution as it can cause


nit: ID -> id

tdas · 2019-01-11T10:40:34Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReadSupport.scala

+          offsetReader.driverKafkaParams.containsKey(ConsumerConfig.GROUP_ID_CONFIG)) {
+        s"$deletedPartitions are gone. ${KafkaSourceProvider.CUSTOM_GROUP_ID_ERROR_MESSAGE}"
+      } else {
+        s"$deletedPartitions are gone. Some data may have been missed"


nit: missed . (add period)

SparkQA · 2019-01-11T19:23:18Z

Test build #101094 has finished for PR 23301 at commit 0e2ca25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2019-01-14T21:24:36Z

LGTM.

zsxwing · 2019-01-14T21:37:07Z

Thanks! Merging to master.

joykrishna · 2019-01-24T08:42:04Z

Hi Team,

Sorry for intruding into this discussion but I have a query regarding 'groupIdPrefix', currently with the version I work with doesn't support its usage and despite the property is set during the creation of kafka read stream, I still see 'spark-kafka-source' as the group id generation string prefix.

Versions of the libraries I use:

spark-streaming-kafka-0-10_2.11: 2.3.1
spark-sql-kafka-0-10_2.11: 2.3.1
spark-sql_2.11: 2.3.1
spark-streaming_2.11: 2.3.1

Kindly let me know which version should have this latest fix for 'groupIdPrefix' usage.

Again, sorry if I misuse this thread for posting the query.

Thanks,
Jaya Krishna

HeartSaVioR · 2019-01-24T09:23:05Z

@joykrishna This PR looks like landed only master branch which will be released as 3.0.0.

joykrishna · 2019-01-24T18:16:41Z

@HeartSaVioR Thanks for the clarification. Any idea by when we are expecting this release?

gaborgsomogyi · 2019-01-24T18:20:22Z

@joykrishna good question, roughly 5-6 months but no promise. It's always faster if this PR is backported.

joykrishna · 2019-01-24T18:22:04Z

@gaborgsomogyi Thanks for the quick reply. Kindly let me know if there is any alternate for using a custom prefix till that point of time.

gaborgsomogyi · 2019-01-24T18:25:36Z

@joykrishna I don't think so it can be done without this change. Would like to help you but backport is really a committer possibility.

joykrishna · 2019-01-24T18:28:12Z

@gaborgsomogyi I understand it. Thanks for the response. Looking forward to see this fix at the earliest as using custom kafka consumer group prefixes is needful when we work with a third party Kafka providers and thereby to distinguish our consumer group patterns.

zsxwing · 2019-01-24T19:08:58Z

@joykrishna Just to be clear. This is a new feature rather than a bug fix. We don't backport new features like this to maintenance branches. Hence, the next available version for this will be 3.0.0. If you cannot wait for the next release, you can try to backport related patches by yourself and build your own Kafka connector.

joykrishna · 2019-01-24T19:16:38Z

@zsxwing Ah, my bad. Understood that it is a new feature and sure I will take a stab at the option you mentioned. Thanks.

## What changes were proposed in this pull request? This PR allows the user to override `kafka.group.id` for better monitoring or security. The user needs to make sure there are not multiple queries or sources using the same group id. It also fixes a bug that the `groupIdPrefix` option cannot be retrieved. ## How was this patch tested? The new added unit tests. Closes apache#23301 from zsxwing/SPARK-26350. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

allow to override group id

fccfee1

zsxwing changed the title ~~[SPARK-26350]Allow to override group id of the Kafka consumer~~ [SPARK-26350][SS]Allow to override group id of the Kafka consumer Dec 12, 2018

fix test

bd1c0b8

HeartSaVioR reviewed Dec 13, 2018

View reviewed changes

address

c279fd5

HeartSaVioR approved these changes Dec 14, 2018

View reviewed changes

HyukjinKwon reviewed Dec 16, 2018

View reviewed changes

address

d9d3065

gaborgsomogyi reviewed Jan 10, 2019

View reviewed changes

gaborgsomogyi approved these changes Jan 11, 2019

View reviewed changes

tdas reviewed Jan 11, 2019

View reviewed changes

Address TD's comment

0e2ca25

asfgit closed this in bafc7ac Jan 14, 2019

zsxwing deleted the SPARK-26350 branch January 14, 2019 21:44

[SPARK-26350][SS]Allow to override group id of the Kafka consumer #23301

[SPARK-26350][SS]Allow to override group id of the Kafka consumer #23301

Conversation

zsxwing commented Dec 12, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

zsxwing commented Dec 12, 2018

SparkQA commented Dec 12, 2018

SparkQA commented Dec 12, 2018

HeartSaVioR left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR left a comment • edited Loading

Choose a reason for hiding this comment

SparkQA commented Dec 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 17, 2018

zsxwing commented Dec 21, 2018

SparkQA commented Dec 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 11, 2019

tdas commented Jan 14, 2019

zsxwing commented Jan 14, 2019

joykrishna commented Jan 24, 2019

HeartSaVioR commented Jan 24, 2019

joykrishna commented Jan 24, 2019

gaborgsomogyi commented Jan 24, 2019

joykrishna commented Jan 24, 2019

gaborgsomogyi commented Jan 24, 2019

joykrishna commented Jan 24, 2019

zsxwing commented Jan 24, 2019

joykrishna commented Jan 24, 2019

zsxwing commented Dec 12, 2018 •

edited

Loading

HeartSaVioR left a comment •

edited

Loading