-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-12374: Add missing config sasl.mechanism.controller.protocol #10199
KAFKA-12374: Add missing config sasl.mechanism.controller.protocol #10199
Conversation
@abbccdda The ZooKeeper version of this test is failing because the producer is never seeing |
self.kafka.restart_cluster() | ||
raise RuntimeError("Kafka restarted successfully but should not have!" | ||
" Endpoint validation did not fail with invalid hostname") | ||
except TimeoutError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this test have to wait for the timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test burns 120 seconds waiting because the restart and shutdown timeouts are both 60 seconds by default. We can adjust the restart timeout to 30 seconds if we wanted to, but the shutdown timeout is set by ducktape during the teardown and can't be changed. So we burn 120 seconds but could probably drop that to 90 seconds. The test currently runs like this: run time: 2 minutes 43.263 seconds
else: | ||
# Raft-based metadata quorum with SSL communication between quorum and broker | ||
# will simply fail to work due to TLS hostname mismatch | ||
self.kafka.remote_controller_quorum.restart_cluster() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, this test wants to mess up the inter-broker listener and see that client's get the correct error. However, in raft mode, we can't load any metadata if the inter-broker listener is non-functional since we cannot connect to the raft quorum. This is analogous to breaking the connection to ZK in ZK mode.
Maybe we should just take the raft mode out of this test and write a new test that exercises similar behavior? Perhaps we could let the inter-broker listener stay properly configured, but mis-configure the controller listener. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that mucking with the broker-controller connectivity is analogous to mucking with the non-Raft brokers' connection to ZooKeeper. We currently don't check that Kafka will not start if hostname verification fails to a TLS-enabled Zookeeper quorum, but that type of a test -- Raft and Zookeeper -- would fit in a "security system test" as we are doing here. I think there's room for checking to make sure we can't start with hostname verification failing to the Raft controller quorum. I don't know if it's worth adding a ZooKeeper case given that TLS to ZooKeeper just became supported in AK 2.7 (meaning adoption is light) and we are moving away from ZooKeeper. Reaction to these thoughts?
Aside from the above, we could consider leaving Broker-to-Raft-Controller connectivity unperturbed/functional and make hostname verification fail for the inter-broker security protocol as you suggest. We could add that after we get the ZooKeeper version working again. I could "fix" that test and add the Raft equivalent, but Boyang needs to take a look to see if the auto topic creation manager changes are causing something to happen that should not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I'm not sure there is a flow change from LEADER_NOT_AVAILABLE
towards INVALID_REPLICATION_FACTOR
, but instead we are returning UNKNOWN_TOPIC_OR_PARTITION
to replace LEADER_NOT_AVAILABLE
for the metadata auto topic creation case in general. Could you give me more explanations on why this case has to be connected with the inter-broker SSL communication? It should only affect client-broker communication IMHO.
Hi @abbccdda. We used to see this error message in
The test does a Now we are instead seeing this in the log file:
And of course now the test fails. The It is a simple matter to make the test pass -- I have confirmed that it passes if we I think we just need to decide if this change in behavior is acceptable or not. |
@mumrah I separated out the communication-to-quorum failure cases into a separate test -- so now we confirm that hostname verification failure to both ZK and Raft Controller causes Kafka to be unable to start. I did not add the test where communication to the Raft quorum is okay but the inter-broker protocol is SSL and failing due to hostname verification. Let me know if you think it is necessary and I can add it. @abbccdda I "fixed" the ZooKeeper test as mentioned above -- we can revert if you decide you need to fix something related to the auto topic creation manager instead. |
aab4d1e
to
29327ba
Compare
@@ -675,6 +676,7 @@ object KafkaConfig { | |||
"KIP-500. If it is not set, the metadata log is placed in the first log directory from log.dirs." | |||
val ControllerListenerNamesDoc = "A comma-separated list of the names of the listeners used by the KIP-500 controller. This is required " + | |||
"if this process is a KIP-500 controller. The ZK-based controller will not use this configuration." | |||
val SaslMechanismControllerProtocolDoc = "SASL mechanism used for broker-to-raft-controller communication. Default is GSSAPI." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably say what KIP-631 says, which is "SASL mechanism used for communication with controllers.
Default is GSSAPI." It is used by Raft controllers to talk to other controllers, after all (assuming we are using SASL for inter-controller communication). Will fix.
# using version.vstring (distutils.version.LooseVersion) is a tricky way of ensuring | ||
# that this check works with DEV_BRANCH | ||
# When running VerifiableProducer 0.8.X, both the current branch version and 0.8.X should show up because of the | ||
# way verifiable producer pulls in some development directories into its classpath | ||
# | ||
# If the test fails here because 'ps .. | grep' couldn't find the process it means | ||
# the login and grep that is_version() performs is slower than | ||
# the time it takes the producer to produce its messages. | ||
# Easy fix is to decrease throughput= above, the good fix is to make the producer | ||
# not terminate until explicitly killed in this case. | ||
if node.version <= LATEST_0_8_2: | ||
assert is_version(node, [node.version.vstring, DEV_BRANCH.vstring], logger=self.logger) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stuff can probably be removed. Remote Raft isn't supported on older versions, and the comment appears above anyway.
} | ||
) | ||
else: | ||
jaas_conf = self.properties['sasl.jaas.config'] | ||
|
||
if self.static_jaas_conf: | ||
node.account.create_file(SecurityConfig.JAAS_CONF_PATH, jaas_conf) | ||
print(jaas_conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will revert this unintentional change
:return: enabled_sasl_mechanisms plus any mechanisms that are used for Raft communication | ||
""" | ||
retval = set([self.client_sasl_mechanism, self.interbroker_sasl_mechanism] + self.raft_sasl_mechanisms) | ||
print(retval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will revert
a3220cc
to
e58e5c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…10199) Fix some cases where we were erroneously using the configuration of the inter broker listener instead of the controller listener. Add the sasl.mechanism.controller.protocol configuration key specified by KIP-631. Add some ducktape tests. Reviewers: Colin P. McCabe <cmccabe@apache.org>, David Arthur <mumrah@gmail.com>, Boyang Chen <boyang@confluent.io>
I filed https://issues.apache.org/jira/browse/KAFKA-12381 to follow up on the auto-topic-creation behavior change discussion. |
* apache-github/trunk: (37 commits) KAFKA-10357: Extract setup of changelog from Streams partition assignor (apache#10163) KAFKA-10251: increase timeout for consuming records (apache#10228) KAFKA-12394; Return `TOPIC_AUTHORIZATION_FAILED` in delete topic response if no describe permission (apache#10223) MINOR: Disable transactional/idempotent system tests for Raft quorums (apache#10224) KAFKA-10766: Unit test cases for RocksDBRangeIterator (apache#9717) KAFKA-12289: Adding test cases for prefix scan in InMemoryKeyValueStore (apache#10052) KAFKA-12268: Implement task idling semantics via currentLag API (apache#10137) MINOR: Time and log producer state recovery phases (apache#10241) MINOR: correct the error message of validating uint32 (apache#10193) MINOR: Format the revoking active log output in `StreamsPartitionAssignor` (apache#10242) KAFKA-12323 Follow-up: Refactor the unit test a bit (apache#10205) MINOR: Remove stack trace of the lock exception in a debug log4j (apache#10231) MINOR: Word count should account for extra whitespaces between words (apache#10229) MINOR; Small refactor in `GroupMetadata` (apache#10236) KAFKA-10340: Proactively close producer when cancelling source tasks (apache#10016) KAFKA-12329; kafka-reassign-partitions command should give a better error message when a topic does not exist (apache#10141) KAFKA-12254: Ensure MM2 creates topics with source topic configs (apache#10217) MINOR: fix kafka-metadata-shell.sh (apache#10226) KAFKA-12374: Add missing config sasl.mechanism.controller.protocol (apache#10199) KAFKA-10101: Fix edge cases in Log.recoverLog and LogManager.loadLogs (apache#8812) ...
ZooKeeper-related system tests in zookeeper_security_upgrade_test.py and zookeeper_tls_test.py broke due to #10199. That patch changed the logic of SecurityConfig.enabled_sasl_mechanisms() to only add the inter-broker SASL mechanism when the inter-broker protocol was SASL_{PLAINTEXT,SSL}. The inter-broker protocol is left to default to PLAINTEXT for the SecurityConfig instance associated with Zookeeper since that value doesn't apply to ZooKeeper, so the default inter-broker SASL mechanism of GSSAPI was not being added into the set returned by enabled_sasl_mechanisms(). This is actually correct -- GSSAPI shouldn't be added since inter-broker communication is a Kafka concept and doesn't apply to ZooKeeper. GSSAPI should be added when ZooKeeper uses it, though -- which is the case in these tests. So the prior patch referred to above uncovered a bug: we were relying on the default inter-broker SASL mechanism to signal that Kerberos was being used by ZooKeeper even though the inter-broker protocol has nothing to do with that determination in such cases. This patch explicitly includes GSSAPI in the list of enabled SASL mechanisms when SASL is enabled for use by ZooKeeper. Reviewers: Colin P. McCabe <cmccabe@apache.org>
ZooKeeper-related system tests in zookeeper_security_upgrade_test.py and zookeeper_tls_test.py broke due to #10199. That patch changed the logic of SecurityConfig.enabled_sasl_mechanisms() to only add the inter-broker SASL mechanism when the inter-broker protocol was SASL_{PLAINTEXT,SSL}. The inter-broker protocol is left to default to PLAINTEXT for the SecurityConfig instance associated with Zookeeper since that value doesn't apply to ZooKeeper, so the default inter-broker SASL mechanism of GSSAPI was not being added into the set returned by enabled_sasl_mechanisms(). This is actually correct -- GSSAPI shouldn't be added since inter-broker communication is a Kafka concept and doesn't apply to ZooKeeper. GSSAPI should be added when ZooKeeper uses it, though -- which is the case in these tests. So the prior patch referred to above uncovered a bug: we were relying on the default inter-broker SASL mechanism to signal that Kerberos was being used by ZooKeeper even though the inter-broker protocol has nothing to do with that determination in such cases. This patch explicitly includes GSSAPI in the list of enabled SASL mechanisms when SASL is enabled for use by ZooKeeper. Reviewers: Colin P. McCabe <cmccabe@apache.org>
sasl.mechanism.controller.protocol
KafkaRaftManager
to use the first entry incontroller.listener.names
to determine the listener name; that listener name's mapped value in thelistener.security.protocol.map
(if such a mapping exists, otherwise the listener name itself) for the security protocol; and the value ofsasl.mechanism.controller.protocol
for the SASL mechanism. Prior to this patch it was incorrectly using inter-broker security information when it connects to the Raft controller quorum.RaftControllerNodeProvider
to use the value ofsasl.mechanism.controller.protocol
instead of the inter-broker sasl mechanism (it was already determining the listener name and security protocol correctly)tests/kafkatest/tests/core/security_test.py
to correctly test various TLS hostname verification scenarios for Raft controller quorums, and includes new tests to confirm that hostname verification failures to both a Raft metadata quorum and a ZooKeeper quorum prevent Kafka from starting. The system tests now exercise the changes toKafkaRaftManager
andRaftControllerNodeProvider
described above.tests/kafkatest/sanity_checks/test_verifiable_producer.py
to check for various Raft-related security protocol/sasl mechanism combinations to make sure they work.Committer Checklist (excluded from commit message)