Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-12374: Add missing config sasl.mechanism.controller.protocol #10199

Merged
merged 11 commits into from
Feb 27, 2021

Conversation

rondagostino
Copy link
Contributor

@rondagostino rondagostino commented Feb 23, 2021

  • Implements the KIP-631 config sasl.mechanism.controller.protocol
  • Updates KafkaRaftManager to use the first entry in controller.listener.names to determine the listener name; that listener name's mapped value in the listener.security.protocol.map (if such a mapping exists, otherwise the listener name itself) for the security protocol; and the value of sasl.mechanism.controller.protocol for the SASL mechanism. Prior to this patch it was incorrectly using inter-broker security information when it connects to the Raft controller quorum.
  • Updates RaftControllerNodeProvider to use the value of sasl.mechanism.controller.protocol instead of the inter-broker sasl mechanism (it was already determining the listener name and security protocol correctly)
  • Updates system tests in tests/kafkatest/tests/core/security_test.py to correctly test various TLS hostname verification scenarios for Raft controller quorums, and includes new tests to confirm that hostname verification failures to both a Raft metadata quorum and a ZooKeeper quorum prevent Kafka from starting. The system tests now exercise the changes to KafkaRaftManager and RaftControllerNodeProvider described above.
  • Adds system tests in tests/kafkatest/sanity_checks/test_verifiable_producer.py to check for various Raft-related security protocol/sasl mechanism combinations to make sure they work.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@rondagostino
Copy link
Contributor Author

@abbccdda The ZooKeeper version of this test is failing because the producer is never seeing LEADER_NOT_AVAILABLE. The issue seems to be that when the broker sees a METADATA request for test_topic after it restarts it is determining that the topic needs to be created, and then the auto topic creation manager is determining that there aren't enough brokers available -- so it returns INVALID_REPLICATION_FACTOR for that topic in the Metadata response. In other words, the flow has changed and the issue is not manifesting as it was before.

@cmccabe cmccabe added the kraft label Feb 24, 2021
self.kafka.restart_cluster()
raise RuntimeError("Kafka restarted successfully but should not have!"
" Endpoint validation did not fail with invalid hostname")
except TimeoutError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this test have to wait for the timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test burns 120 seconds waiting because the restart and shutdown timeouts are both 60 seconds by default. We can adjust the restart timeout to 30 seconds if we wanted to, but the shutdown timeout is set by ducktape during the teardown and can't be changed. So we burn 120 seconds but could probably drop that to 90 seconds. The test currently runs like this: run time: 2 minutes 43.263 seconds

else:
# Raft-based metadata quorum with SSL communication between quorum and broker
# will simply fail to work due to TLS hostname mismatch
self.kafka.remote_controller_quorum.restart_cluster()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, this test wants to mess up the inter-broker listener and see that client's get the correct error. However, in raft mode, we can't load any metadata if the inter-broker listener is non-functional since we cannot connect to the raft quorum. This is analogous to breaking the connection to ZK in ZK mode.

Maybe we should just take the raft mode out of this test and write a new test that exercises similar behavior? Perhaps we could let the inter-broker listener stay properly configured, but mis-configure the controller listener. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that mucking with the broker-controller connectivity is analogous to mucking with the non-Raft brokers' connection to ZooKeeper. We currently don't check that Kafka will not start if hostname verification fails to a TLS-enabled Zookeeper quorum, but that type of a test -- Raft and Zookeeper -- would fit in a "security system test" as we are doing here. I think there's room for checking to make sure we can't start with hostname verification failing to the Raft controller quorum. I don't know if it's worth adding a ZooKeeper case given that TLS to ZooKeeper just became supported in AK 2.7 (meaning adoption is light) and we are moving away from ZooKeeper. Reaction to these thoughts?

Aside from the above, we could consider leaving Broker-to-Raft-Controller connectivity unperturbed/functional and make hostname verification fail for the inter-broker security protocol as you suggest. We could add that after we get the ZooKeeper version working again. I could "fix" that test and add the Raft equivalent, but Boyang needs to take a look to see if the auto topic creation manager changes are causing something to happen that should not.

Copy link
Contributor

@abbccdda abbccdda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I'm not sure there is a flow change from LEADER_NOT_AVAILABLE towards INVALID_REPLICATION_FACTOR, but instead we are returning UNKNOWN_TOPIC_OR_PARTITION to replace LEADER_NOT_AVAILABLE for the metadata auto topic creation case in general. Could you give me more explanations on why this case has to be connected with the inter-broker SSL communication? It should only affect client-broker communication IMHO.

@rondagostino
Copy link
Contributor Author

Hi @abbccdda. We used to see this error message in verifiable_producer.log when security_protocol='PLAINTEXT', interbroker_security_protocol='SSL':

WARN [Producer clientId=producer-1] Error while fetching metadata with correlation id 1 : {test_topic=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)

The test does a grep LEADER_NOT_AVAILABLE on the log in this case, and it used to pass.

Now we are instead seeing this in the log file:

WARN [Producer clientId=producer-1] Error while fetching metadata with correlation id 1 : {test_topic=INVALID_REPLICATION_FACTOR} (org.apache.kafka.clients.NetworkClient)

And of course now the test fails.

The INVALID_REPLICATION_FACTOR is coming from the auto topic creation manager as I described above.

It is a simple matter to make the test pass -- I have confirmed that it passes if we grep for INVALID_REPLICATION_FACTOR in the log file instead of LEADER_NOT_AVAILABLE.

I think we just need to decide if this change in behavior is acceptable or not.

@rondagostino
Copy link
Contributor Author

@mumrah I separated out the communication-to-quorum failure cases into a separate test -- so now we confirm that hostname verification failure to both ZK and Raft Controller causes Kafka to be unable to start.

I did not add the test where communication to the Raft quorum is okay but the inter-broker protocol is SSL and failing due to hostname verification. Let me know if you think it is necessary and I can add it.

@abbccdda I "fixed" the ZooKeeper test as mentioned above -- we can revert if you decide you need to fix something related to the auto topic creation manager instead.

@rondagostino rondagostino force-pushed the fix_security_tesxt_for_raft_case branch from aab4d1e to 29327ba Compare February 24, 2021 22:31
@rondagostino rondagostino changed the title MINOR: Fix security_test system test for Raft case KAFKA-12374: Add missing config sasl.mechanism.controller.protocol Feb 24, 2021
@@ -675,6 +676,7 @@ object KafkaConfig {
"KIP-500. If it is not set, the metadata log is placed in the first log directory from log.dirs."
val ControllerListenerNamesDoc = "A comma-separated list of the names of the listeners used by the KIP-500 controller. This is required " +
"if this process is a KIP-500 controller. The ZK-based controller will not use this configuration."
val SaslMechanismControllerProtocolDoc = "SASL mechanism used for broker-to-raft-controller communication. Default is GSSAPI."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably say what KIP-631 says, which is "SASL mechanism used for communication with controllers.
Default is GSSAPI." It is used by Raft controllers to talk to other controllers, after all (assuming we are using SASL for inter-controller communication). Will fix.

Comment on lines 135 to 147
# using version.vstring (distutils.version.LooseVersion) is a tricky way of ensuring
# that this check works with DEV_BRANCH
# When running VerifiableProducer 0.8.X, both the current branch version and 0.8.X should show up because of the
# way verifiable producer pulls in some development directories into its classpath
#
# If the test fails here because 'ps .. | grep' couldn't find the process it means
# the login and grep that is_version() performs is slower than
# the time it takes the producer to produce its messages.
# Easy fix is to decrease throughput= above, the good fix is to make the producer
# not terminate until explicitly killed in this case.
if node.version <= LATEST_0_8_2:
assert is_version(node, [node.version.vstring, DEV_BRANCH.vstring], logger=self.logger)
else:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff can probably be removed. Remote Raft isn't supported on older versions, and the comment appears above anyway.

}
)
else:
jaas_conf = self.properties['sasl.jaas.config']

if self.static_jaas_conf:
node.account.create_file(SecurityConfig.JAAS_CONF_PATH, jaas_conf)
print(jaas_conf)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revert this unintentional change

:return: enabled_sasl_mechanisms plus any mechanisms that are used for Raft communication
"""
retval = set([self.client_sasl_mechanism, self.interbroker_sasl_mechanism] + self.raft_sasl_mechanisms)
print(retval)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will revert

@rondagostino rondagostino force-pushed the fix_security_tesxt_for_raft_case branch from a3220cc to e58e5c9 Compare February 26, 2021 23:34
Copy link
Contributor

@cmccabe cmccabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmccabe cmccabe merged commit 5d37901 into apache:trunk Feb 27, 2021
cmccabe pushed a commit that referenced this pull request Feb 27, 2021
…10199)

Fix some cases where we were erroneously using the configuration of the inter broker
listener instead of the controller listener.  Add the sasl.mechanism.controller.protocol
configuration key specified by KIP-631.  Add some ducktape tests.

Reviewers: Colin P. McCabe <cmccabe@apache.org>, David Arthur <mumrah@gmail.com>, Boyang Chen <boyang@confluent.io>
@cmccabe
Copy link
Contributor

cmccabe commented Feb 27, 2021

I filed https://issues.apache.org/jira/browse/KAFKA-12381 to follow up on the auto-topic-creation behavior change discussion.

ijuma added a commit to ijuma/kafka that referenced this pull request Mar 2, 2021
* apache-github/trunk: (37 commits)
  KAFKA-10357: Extract setup of changelog from Streams partition assignor (apache#10163)
  KAFKA-10251: increase timeout for consuming records (apache#10228)
  KAFKA-12394; Return `TOPIC_AUTHORIZATION_FAILED` in delete topic response if no describe permission (apache#10223)
  MINOR: Disable transactional/idempotent system tests for Raft quorums (apache#10224)
  KAFKA-10766: Unit test cases for RocksDBRangeIterator (apache#9717)
  KAFKA-12289: Adding test cases for prefix scan in InMemoryKeyValueStore (apache#10052)
  KAFKA-12268: Implement task idling semantics via currentLag API (apache#10137)
  MINOR: Time and log producer state recovery phases (apache#10241)
  MINOR: correct the error message of validating uint32 (apache#10193)
  MINOR: Format the revoking active log output in `StreamsPartitionAssignor` (apache#10242)
  KAFKA-12323 Follow-up: Refactor the unit test a bit (apache#10205)
  MINOR: Remove stack trace of the lock exception in a debug log4j (apache#10231)
  MINOR: Word count should account for extra whitespaces between words (apache#10229)
  MINOR; Small refactor in `GroupMetadata` (apache#10236)
  KAFKA-10340: Proactively close producer when cancelling source tasks (apache#10016)
  KAFKA-12329; kafka-reassign-partitions command should give a better error message when a topic does not exist (apache#10141)
  KAFKA-12254: Ensure MM2 creates topics with source topic configs (apache#10217)
  MINOR: fix kafka-metadata-shell.sh (apache#10226)
  KAFKA-12374: Add missing config sasl.mechanism.controller.protocol (apache#10199)
  KAFKA-10101: Fix edge cases in Log.recoverLog and LogManager.loadLogs (apache#8812)
  ...
cmccabe pushed a commit that referenced this pull request Mar 17, 2021
ZooKeeper-related system tests in zookeeper_security_upgrade_test.py and
zookeeper_tls_test.py broke due to #10199. That patch changed the logic of
SecurityConfig.enabled_sasl_mechanisms() to only add the inter-broker SASL
mechanism when the inter-broker protocol was SASL_{PLAINTEXT,SSL}. The
inter-broker protocol is left to default to PLAINTEXT for the SecurityConfig
instance associated with Zookeeper since that value doesn't apply to ZooKeeper,
so the default inter-broker SASL mechanism of GSSAPI was not being added into
the set returned by enabled_sasl_mechanisms(). This is actually correct --
GSSAPI shouldn't be added since inter-broker communication is a Kafka concept
and doesn't apply to ZooKeeper. GSSAPI should be added when ZooKeeper uses it,
though -- which is the case in these tests. So the prior patch referred to
above uncovered a bug: we were relying on the default inter-broker SASL
mechanism to signal that Kerberos was being used by ZooKeeper even though the
inter-broker protocol has nothing to do with that determination in such cases.
This patch explicitly includes GSSAPI in the list of enabled SASL mechanisms
when SASL is enabled for use by ZooKeeper.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
cmccabe pushed a commit that referenced this pull request Mar 17, 2021
ZooKeeper-related system tests in zookeeper_security_upgrade_test.py and
zookeeper_tls_test.py broke due to #10199. That patch changed the logic of
SecurityConfig.enabled_sasl_mechanisms() to only add the inter-broker SASL
mechanism when the inter-broker protocol was SASL_{PLAINTEXT,SSL}. The
inter-broker protocol is left to default to PLAINTEXT for the SecurityConfig
instance associated with Zookeeper since that value doesn't apply to ZooKeeper,
so the default inter-broker SASL mechanism of GSSAPI was not being added into
the set returned by enabled_sasl_mechanisms(). This is actually correct --
GSSAPI shouldn't be added since inter-broker communication is a Kafka concept
and doesn't apply to ZooKeeper. GSSAPI should be added when ZooKeeper uses it,
though -- which is the case in these tests. So the prior patch referred to
above uncovered a bug: we were relying on the default inter-broker SASL
mechanism to signal that Kerberos was being used by ZooKeeper even though the
inter-broker protocol has nothing to do with that determination in such cases.
This patch explicitly includes GSSAPI in the list of enabled SASL mechanisms
when SASL is enabled for use by ZooKeeper.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants