Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager #11161

res-life · 2022-06-28T01:35:23Z

Problem

Prints RapidsHostMemoryStore.pool leaked error log when running Rapids Accelerator test cases.

All tests passed.

22/06/27 17:45:57.298 Thread-7 ERROR HostMemoryBuffer: A HOST BUFFER WAS LEAKED (ID: 1 7f8557fff010)
22/06/27 17:45:57.303 Thread-7 ERROR MemoryCleaner: Leaked host buffer (ID: 1): 2022-06-27 09:45:16.0171 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1559)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:301)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:82)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:232)
ai.rapids.cudf.MemoryBuffer.<init>(MemoryBuffer.java:98)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:196)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:192)
ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:144)
com.nvidia.spark.rapids.RapidsHostMemoryStore.<init>(RapidsHostMemoryStore.scala:38)

Root cause

RapidsHostMemoryStore.pool is not closed before MemoryCleaner checking the leaks.
It's actually not a leak, it's caused by hooks execution order.
RapidsHostMemoryStore.pool is closed in the Spark executor plugin hook.

plugins.foreach(_.shutdown())  // this line will eventually close the RapidsHostMemoryStore.pool

The close path is:

  The close path is: 
    Spark executor plugin hook ->
      RapidsExecutorPlugin.shutdown ->
        GpuDeviceManager.shutdown ->
          RapidsBufferCatalog.close() ->
            RapidsHostMemoryStore.close ->
              RapidsHostMemoryStore.pool.close ->

Rapids Accelerator JNI also checks leaks in a shutdown hook.
Shutdown hooks are executed concurrently, there is no execution order guarantee.

solution 1 - Not recommanded

Just wait one second before checking the leak in the MemoryCleaner.
It's modifying debug code, it's modifying closing code, and has no impact on production code.

solution 2 - Not recommanded

Spark has a util class ShutdownHookManager which is a ShutdownHook wrapper.
It can addShutdownHook with priority via Hadoop ShutdownHookManager

def addShutdownHook(priority: Int)(hook: () => Unit): AnyRef = {

Leveraging Hadoop ShutdownHookManager as Spark does is feasible.

Solution 3 Recommanded

Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager.

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>

codecov · 2022-06-28T03:22:25Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@e0003a0). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #11161   +/-   ##
===============================================
  Coverage                ?   86.29%           
===============================================
  Files                   ?      144           
  Lines                   ?    22698           
  Branches                ?        0           
===============================================
  Hits                    ?    19588           
  Misses                  ?     3110           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0003a0...da8a6b6. Read the comment docs.

gerashegalov · 2022-06-28T07:09:32Z

IMO: Using sleep in tests is an anti-pattern, using sleep in production code to make the test code work is a worse anti-pattern.

Shutdown Hook Manager in hadoop-common predates Spark's . We can either switch hadoop-common from test to compile scope or simply copy the idea to cudf-java.

revans2 · 2022-06-28T13:29:30Z

I agree a sleep is not what we want.

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2022-06-29T02:14:36Z

IMO: Using sleep in tests is an anti-pattern, using sleep in production code to make the test code work is a worse anti-pattern.

Shutdown Hook Manager in hadoop-common predates Spark's . We can either switch hadoop-common from test to compile scope or simply copy the idea to cudf-java.

Good idea, thanks.
Leveraging Hadoop ShutdownHookManager as Spark does is feasible.

res-life · 2022-06-29T02:22:19Z

Introduced a compile scope dependency, does this impact other projects?

If cuDF(java) is only used by Rapids Accelerator it's OK, because Spark runtime classpath definitely contains
org.apache.hadoop.util.ShutdownHookManager class

firestarman · 2022-06-29T02:43:33Z

java/pom.xml

            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.2.3</version>
-            <scope>test</scope>
+            <scope>compile</scope>


I prefer provided to compile according to your comment #11161 (comment)

Agree about 'provided' to avoid pulling in Hadoop in the shade plug-in

databricks class path may not contain org.apache.hadoop.util.ShutdownHookManager. @gerashegalov

may not? we should see this at the compile time during Db build. Please add [databricks] to the PR title to double-check.

firestarman · 2022-06-29T02:49:44Z

java/src/main/java/ai/rapids/cudf/MemoryCleaner.java

@@ -214,10 +214,24 @@ static void setDefaultGpu(int defaultGpuId) {
        if (defaultGpu >= 0) {
          Cuda.setDevice(defaultGpu);
        }
+
        for (CleanerWeakReference cwr : all) {


I just got another possible solution.
Cleaner has an API isClean, you can call it to do some check before the real cleaning in CleanerWeakReference.clean to avoid duplicate cleaning.

This bug was caused by leak instead of duplicate cleaning.
The resources should be closed by cleaner.clean(logErrorIfNotClean = false) trigged by withResource clause before this leak checking:

// leak checking in shutdown hook: for (CleanerWeakReference cwr : all) { // this invoke cleaner.clean(logErrorIfNotClean = true) // If resource is not closed before, close it and then prints error log. cwr.clean(); }

res-life · 2022-06-29T03:11:55Z

simply copy the idea to cudf-java.

This seems not work. The Spark hooks and this hook should reside in same Hadoop ShutdownHookManager

res-life · 2022-06-30T09:47:01Z

rerurn tests

res-life · 2022-06-30T10:02:39Z

Tested databricks 312 and 321, this PR works.

The Hadoop ShutdownHookManager is provided in:

databricks 312
/databricks/jars/----workspace_spark_3_1--maven-trees--hive-2.3__hadoop-2.7--org.apache.hadoop--hadoop-common--org.apache.hadoop__hadoop-common__2.7.4.jar

databricks 321
/databricks/jars/----workspace_spark_3_2--maven-trees--hive-2.3__hadoop-3.2--org.apache.hadoop--hadoop-client-api--org.apache.hadoop__hadoop-client-api__3.3.1-databricks.jar

res-life · 2022-06-30T10:32:25Z

Build failed in cuda, seems it's not introduced by this PR.

revans2 · 2022-07-01T16:43:27Z

java/src/main/java/ai/rapids/cudf/MemoryCleaner.java

+      // Here also use `Hadoop ShutdownHookManager` to add a lower priority hook.
+      // 20 priority is small enough, will run after Spark hooks.
+      // Note: `ShutdownHookManager.get()` is a singleton, Spark and JNI both use the same singleton
+      org.apache.hadoop.util.ShutdownHookManager.get().addShutdownHook(hook, 20);


I really don't like using the Hadoop ShutdownHookManager here. To solve the problem all of the shutdown hooks that might conflict with each other must use this manager too. Which in some cases they do, but in other cases they will not. This is a library and it is not Hadoop specific.

I don't want to over engineer this, but I personally would like to see a way for the user of the library to provide the way for the shutdown code to run, and if none is provided then the regular java Runtime is used. But this is hard because all of the code is running inside a static block. We might need to first register the hook with the java Runtime and then remove it if/when someone tells us to use a different shutdown mechanism.

1/ We can create a Runnable hook without automatically adding it to Runtime.

2/ We can add a Boolean system property for whether to add shutdown hook directly to the Runtime.

3/ Then have a contract MemroryCleaner.getCleanerHook() that returns non-null from 1 if the Boolean from 2 is false

That sounds like a great solution to me.

Already had one Boolean REF_COUNT_DEBUG:
https://github.com/rapidsai/cudf/blob/branch-22.08/java/src/main/java/ai/rapids/cudf/MemoryCleaner.java#L203

if (REF_COUNT_DEBUG) { // If we are debugging things do a best effort to check for leaks at the end Runtime.getRuntime().addShutdownHook(new Thread(() -> { ...... for (CleanerWeakReference cwr : all) { cwr.clean(); } })); }

If disable REF_COUNT_DEBUG, there are no fake error logs.
We can create a new one like REF_COUNT_DEBUG_WHEN_CLOSE to replace it.
But still have this problem if enables the boolean.

In my interpretation ai.rapids.refcount.debug is here to determine whether to check for leaks at shutdown time, and not about how to achieve this.

So either another Boolean switch to the tune of ai.rapids.refcount.debug.addShutdownHook
or we can change ai.rapids.refcount.debug to a String property with valid values to the tune of: "none", "withInternalShutdownHook", "withExternalShutdownHook" . We can use an Enum to manage them.

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2022-07-06T09:00:22Z

Pushed a new commit:
Provided a method for the user to remove the hook and register the hook in a custom shutdown hook manager

RAPIDS Accelerator side change:
NVIDIA/spark-rapids#5955

gerashegalov · 2022-07-06T17:39:03Z

java/src/main/java/ai/rapids/cudf/MemoryCleaner.java

-          cwr.clean();
-        }
-      }));
+      Runtime.getRuntime().addShutdownHook(DEFAULT_SHUTDOWN_THREAD);


pro: we are getting away without adding more switches
con: we are risking race conditions depending on the upper layer

gerashegalov · 2022-07-06T17:44:35Z

java/src/main/java/ai/rapids/cudf/MemoryCleaner.java

+   * Runnable used to be added to Java default shutdown hook.
+   * It checks the leaks at shutdown time.
+   * If you want to register the Runnable in a custom shutdown hook manager instead of Java default shutdown hook,
+   * should first remove it and then add it.


link to the API for remove

Suggested change

* should first remove it and then add it.

* should first remove it using {@link #removeDefaultShutdownHook()} and then add it.

gerashegalov · 2022-07-06T17:54:10Z

java/src/main/java/ai/rapids/cudf/MemoryCleaner.java

    }
  }

+  public static void removeDefaultShutdownHook() {


javadoc for the public method?

to reduce the risk of race conditions we could change the signature of this method

public static Runnable removeDefaultShutdownHook()

after removing DEFAULT_SHUTDOWN_THREAD it would return DEFAULT_SHUTDOWN_RUNNABLE that the caller can register in their shutdown manager

DEFAULT_SHUTDOWN_RUNNABLE can be private then

res-life · 2022-07-07T05:45:05Z

@gpucibot merge

Add sleep in the shutdown hook where checking the leak

35353ac

Signed-off-by: Chong Gao <res_life@163.com>

res-life requested a review from a team as a code owner June 28, 2022 01:35

github-actions bot added the Java Affects Java cuDF API. label Jun 28, 2022

res-life changed the title ~~Add sleep in the shutdown hook where checking the leak~~ Add sleep in the MemoryCleaner shutdown hook to avoid false leaks Jun 28, 2022

res-life added bug Something isn't working non-breaking Non-breaking change labels Jun 28, 2022

Use to make sure leak checking is run after the Spark hooks

cc42761

Signed-off-by: Chong Gao <res_life@163.com>

res-life changed the title ~~Add sleep in the MemoryCleaner shutdown hook to avoid false leaks~~ Use Hadoop ShutdownHookManager to make sure leak checking is run after the Spark hooks Jun 29, 2022

firestarman reviewed Jun 29, 2022

View reviewed changes

res-life marked this pull request as draft June 29, 2022 03:14

Change dependency scope

1d7f511

Change comments to retrigger build

ad05d6c

res-life marked this pull request as ready for review June 30, 2022 10:03

revans2 reviewed Jul 1, 2022

View reviewed changes

res-life marked this pull request as draft July 5, 2022 03:31

Provide custom registration of checking leak hook

a857224

Signed-off-by: Chong Gao <res_life@163.com>

res-life mentioned this pull request Jul 6, 2022

Fix fake memory leaks in some test cases [databricks] NVIDIA/spark-rapids#5955

Merged

res-life marked this pull request as ready for review July 6, 2022 09:02

revans2 approved these changes Jul 6, 2022

View reviewed changes

gerashegalov reviewed Jul 6, 2022

View reviewed changes

res-life changed the title ~~Use Hadoop ShutdownHookManager to make sure leak checking is run after the Spark hooks~~ Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager Jul 7, 2022

Address comments

da8a6b6

rapids-bot bot merged commit 8426a99 into rapidsai:branch-22.08 Jul 7, 2022

res-life deleted the fix-false-leak-error branch July 7, 2022 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager #11161

Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager #11161

res-life commented Jun 28, 2022 •

edited

Loading

codecov bot commented Jun 28, 2022 •

edited

Loading

gerashegalov commented Jun 28, 2022

revans2 commented Jun 28, 2022

res-life commented Jun 29, 2022

res-life commented Jun 29, 2022

firestarman Jun 29, 2022

gerashegalov Jun 29, 2022

res-life Jun 29, 2022

gerashegalov Jun 29, 2022

res-life Jun 30, 2022

firestarman Jun 29, 2022 •

edited

Loading

res-life Jul 1, 2022

res-life commented Jun 29, 2022

res-life commented Jun 30, 2022

res-life commented Jun 30, 2022

res-life commented Jun 30, 2022

revans2 Jul 1, 2022

gerashegalov Jul 1, 2022

revans2 Jul 5, 2022

res-life Jul 6, 2022

gerashegalov Jul 6, 2022

res-life commented Jul 6, 2022

gerashegalov Jul 6, 2022

gerashegalov Jul 6, 2022 •

edited

Loading

res-life Jul 7, 2022

gerashegalov Jul 6, 2022

res-life Jul 7, 2022

res-life commented Jul 7, 2022

	* should first remove it and then add it.
	* should first remove it using {@link #removeDefaultShutdownHook()} and then add it.

Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager #11161

Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager #11161

Conversation

res-life commented Jun 28, 2022 • edited Loading

Problem

Root cause

solution 1 - Not recommanded

solution 2 - Not recommanded

Solution 3 Recommanded

codecov bot commented Jun 28, 2022 • edited Loading

Codecov Report

gerashegalov commented Jun 28, 2022

revans2 commented Jun 28, 2022

res-life commented Jun 29, 2022

res-life commented Jun 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

firestarman Jun 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Jun 29, 2022

res-life commented Jun 30, 2022

res-life commented Jun 30, 2022

res-life commented Jun 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Jul 6, 2022

Choose a reason for hiding this comment

gerashegalov Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Jul 7, 2022

res-life commented Jun 28, 2022 •

edited

Loading

codecov bot commented Jun 28, 2022 •

edited

Loading

firestarman Jun 29, 2022 •

edited

Loading

gerashegalov Jul 6, 2022 •

edited

Loading