KAFKA-10723: Fix LogManager shutdown error handling #9596

kowshik · 2020-11-15T01:59:13Z

The asynchronous shutdown in LogManager has the shortcoming that if during shutdown any of the internal futures fail, then we do not always ensure that all futures are completed before LogManager.shutdown returns. This is because, this line in the finally clause shuts down the thread pools asynchronously. As a result, despite the shut down completed message from KafkaServer is seen in the error logs, some futures continue to run from inside LogManager attempting to close some logs. This is misleading during debugging. Also sometimes it introduces an avoidable post-shutdown activity where resources (such as file handles) are released or persistent state is checkpointed in the Broker.

In this PR, we fix the above behavior such that we prevent leakage of threads. If any of the futures throw an error, we skip creating of checkpoint and clean shutdown file only for the affected log directory. We continue to wait for all futures to complete for all the directories.

Test plan:

Added a new unit test: LogManager.testHandlingExceptionsDuringShutdown.

kowshik · 2020-11-15T06:30:14Z

cc @dhruvilshah3 @junrao @hachikuji for review

lqjack · 2020-11-16T09:56:43Z

core/src/test/scala/unit/kafka/log/LogManagerTest.scala

+    log2.appendAsLeader(TestUtils.singletonRecords("test2".getBytes()), leaderEpoch = 0)
+
+    // This should cause log1.close() to fail during LogManger shutdown sequence.
+    FileUtils.deleteDirectory(logFile1)


If the end user delete the log files Manually , the server cannot be stopped. and The cannot startup it again? so in this case ,how do they resolve it ?

Sorry I do not understand the question.

What if error occur during the shutdown of the broker ? should we log the error info to the log or just throw the exception ?

It depends on the kind of error, but we do log the error information to the log today from within KafkaServer.shutdown().

junrao

@kowshik : Thanks for the PR. A couple of comments below.

junrao · 2020-11-16T22:22:04Z

core/src/main/scala/kafka/log/LogManager.scala

-      case e: ExecutionException =>
-        error(s"There was an error in one of the threads during LogManager shutdown: ${e.getCause}")
-        throw e.getCause
+      firstExceptionOpt.foreach{ e => throw e}


Hmm, since we are about to shut down the JVM, should we just log a WARN here instead of throwing the exception?

Great point. I've changed the code to do the same.
My understanding is that the exception swallow safety net exists inside KafkaServer.shutdown() today, but it makes sense to also just log a warning here instead instead of relying on the safety net:

kafka/core/src/main/scala/kafka/server/KafkaServer.scala

Line 732 in bb34c5c

CoreUtils.swallow(logManager.shutdown(), this)

.

junrao · 2020-11-16T22:22:49Z

core/src/test/scala/unit/kafka/log/LogManagerTest.scala

+   */
+  @Test
+  def testHandlingExceptionsDuringShutdown(): Unit = {
+    logManager.shutdown()


Hmm, do we need this given that we do this in tearDown() already?

Yeah this explicit shutdown is needed to:

Re-create a new LogManager instance with multiple logDirs for this test. This is different from the default one provided in setUp().

Help do some additional checks post shutdown (towards the end of this test).

Thinking about it again, you are right. I have eliminated the need for the shutdown() now by using a LogManager instance specific to the test.

kowshik · 2020-11-17T03:00:46Z

Thanks for the review @junrao! I have addressed the comments in f917f0c.

ijuma · 2020-11-18T06:06:57Z

core/src/main/scala/kafka/log/LogManager.scala

@@ -479,25 +479,33 @@ class LogManager(logDirs: Seq[File],

    try {
      for ((dir, dirJobs) <- jobs) {
-        dirJobs.foreach(_.get)
+        val hasErrors = dirJobs.exists {
+          future =>


Nit: this should be in the previous line.

ijuma · 2020-11-18T06:08:16Z

core/src/main/scala/kafka/log/LogManager.scala

@@ -479,25 +479,33 @@ class LogManager(logDirs: Seq[File],

    try {
      for ((dir, dirJobs) <- jobs) {
-        dirJobs.foreach(_.get)
+        val hasErrors = dirJobs.exists {


This looks wrong. exists short-circuits. I think you want map followed by exists.

Thats a really good point. Done.

ijuma · 2020-11-18T06:09:09Z

core/src/main/scala/kafka/log/LogManager.scala

-        dirJobs.foreach(_.get)
+        val hasErrors = dirJobs.exists {
+          future =>
+            try {


You can use scala.util.Try to wrap the call and get a Success or Failure.

Good idea, done.

kowshik · 2020-11-18T06:52:06Z

Thanks for the review @ijuma ! I have addressed the comments in 8716429 .

junrao

@kowshik : Thanks for the latest PR. LGTM

KAFKA-10723: Fix LogManager shutdown error handling

3a36b54

kowshik force-pushed the KAFKA-10723_LogManager_shutdown_fix branch from 89243a3 to 3a36b54 Compare November 15, 2020 01:59

lqjack reviewed Nov 16, 2020

View reviewed changes

junrao reviewed Nov 16, 2020

View reviewed changes

Address comments from Jun

f917f0c

kowshik force-pushed the KAFKA-10723_LogManager_shutdown_fix branch from 71cbf59 to f917f0c Compare November 17, 2020 02:52

kowshik requested a review from junrao November 17, 2020 03:00

ijuma reviewed Nov 18, 2020

View reviewed changes

Address comments from Ismael

8716429

junrao approved these changes Nov 19, 2020

View reviewed changes

junrao merged commit dcbd28d into apache:trunk Nov 19, 2020

kowshik mentioned this pull request Dec 10, 2020

MINOR: a small refactor for LogManage#shutdown #9680

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-10723: Fix LogManager shutdown error handling #9596

KAFKA-10723: Fix LogManager shutdown error handling #9596

kowshik commented Nov 15, 2020 •

edited

Loading

kowshik commented Nov 15, 2020 •

edited

Loading

lqjack Nov 16, 2020

kowshik Nov 16, 2020

lqjack Nov 17, 2020

kowshik Nov 17, 2020

junrao left a comment

junrao Nov 16, 2020

kowshik Nov 17, 2020

junrao Nov 16, 2020

kowshik Nov 17, 2020

kowshik Nov 17, 2020

kowshik commented Nov 17, 2020

ijuma Nov 18, 2020

kowshik Nov 18, 2020

ijuma Nov 18, 2020

kowshik Nov 18, 2020

ijuma Nov 18, 2020

kowshik Nov 18, 2020

kowshik commented Nov 18, 2020

junrao left a comment

KAFKA-10723: Fix LogManager shutdown error handling #9596

KAFKA-10723: Fix LogManager shutdown error handling #9596

Conversation

kowshik commented Nov 15, 2020 • edited Loading

kowshik commented Nov 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kowshik commented Nov 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kowshik commented Nov 18, 2020

junrao left a comment

Choose a reason for hiding this comment

kowshik commented Nov 15, 2020 •

edited

Loading

kowshik commented Nov 15, 2020 •

edited

Loading