-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager #11161
Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager #11161
Conversation
Signed-off-by: Chong Gao <res_life@163.com>
Codecov Report
@@ Coverage Diff @@
## branch-22.08 #11161 +/- ##
===============================================
Coverage ? 86.29%
===============================================
Files ? 144
Lines ? 22698
Branches ? 0
===============================================
Hits ? 19588
Misses ? 3110
Partials ? 0 Continue to review full report at Codecov.
|
IMO: Using sleep in tests is an anti-pattern, using sleep in production code to make the test code work is a worse anti-pattern. Shutdown Hook Manager in hadoop-common predates Spark's . We can either switch hadoop-common from test to compile scope or simply copy the idea to cudf-java. |
I agree a sleep is not what we want. |
Signed-off-by: Chong Gao <res_life@163.com>
Good idea, thanks. |
Introduced a compile scope dependency, does this impact other projects? If cuDF(java) is only used by Rapids Accelerator it's OK, because Spark runtime classpath definitely contains |
Hadoop ShutdownHookManager
to make sure leak checking is run after the Spark hooks
java/pom.xml
Outdated
<groupId>org.apache.hadoop</groupId> | ||
<artifactId>hadoop-common</artifactId> | ||
<version>3.2.3</version> | ||
<scope>test</scope> | ||
<scope>compile</scope> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer provided
to compile
according to your comment #11161 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree about 'provided' to avoid pulling in Hadoop in the shade plug-in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
databricks
class path may not contain org.apache.hadoop.util.ShutdownHookManager. @gerashegalov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may not? we should see this at the compile time during Db build. Please add [databricks]
to the PR title to double-check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -214,10 +214,24 @@ static void setDefaultGpu(int defaultGpuId) { | |||
if (defaultGpu >= 0) { | |||
Cuda.setDevice(defaultGpu); | |||
} | |||
|
|||
for (CleanerWeakReference cwr : all) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just got another possible solution.
Cleaner has an API isClean
, you can call it to do some check before the real cleaning in CleanerWeakReference.clean
to avoid duplicate cleaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bug was caused by leak
instead of duplicate cleaning.
The resources should be closed by cleaner.clean(logErrorIfNotClean = false)
trigged by withResource
clause before this leak checking:
// leak checking in shutdown hook:
for (CleanerWeakReference cwr : all) {
// this invoke cleaner.clean(logErrorIfNotClean = true)
// If resource is not closed before, close it and then prints error log.
cwr.clean();
}
This seems not work. The Spark hooks and this hook should reside in same |
rerurn tests |
Tested The
|
Build failed in |
// Here also use `Hadoop ShutdownHookManager` to add a lower priority hook. | ||
// 20 priority is small enough, will run after Spark hooks. | ||
// Note: `ShutdownHookManager.get()` is a singleton, Spark and JNI both use the same singleton | ||
org.apache.hadoop.util.ShutdownHookManager.get().addShutdownHook(hook, 20); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like using the Hadoop ShutdownHookManager here. To solve the problem all of the shutdown hooks that might conflict with each other must use this manager too. Which in some cases they do, but in other cases they will not. This is a library and it is not Hadoop specific.
I don't want to over engineer this, but I personally would like to see a way for the user of the library to provide the way for the shutdown code to run, and if none is provided then the regular java Runtime is used. But this is hard because all of the code is running inside a static block. We might need to first register the hook with the java Runtime and then remove it if/when someone tells us to use a different shutdown mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/ We can create a Runnable hook without automatically adding it to Runtime
.
2/ We can add a Boolean system property for whether to add shutdown hook directly to the Runtime
.
3/ Then have a contract MemroryCleaner.getCleanerHook() that returns non-null from 1 if the Boolean from 2 is false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a great solution to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already had one Boolean REF_COUNT_DEBUG
:
https://github.com/rapidsai/cudf/blob/branch-22.08/java/src/main/java/ai/rapids/cudf/MemoryCleaner.java#L203
if (REF_COUNT_DEBUG) {
// If we are debugging things do a best effort to check for leaks at the end
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
......
for (CleanerWeakReference cwr : all) {
cwr.clean();
}
}));
}
If disable REF_COUNT_DEBUG
, there are no fake error logs.
We can create a new one like REF_COUNT_DEBUG_WHEN_CLOSE
to replace it.
But still have this problem if enables the boolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my interpretation ai.rapids.refcount.debug
is here to determine whether to check for leaks at shutdown time, and not about how to achieve this.
So either another Boolean switch to the tune of ai.rapids.refcount.debug.addShutdownHook
or we can change ai.rapids.refcount.debug
to a String property with valid values to the tune of: "none", "withInternalShutdownHook", "withExternalShutdownHook" . We can use an Enum to manage them.
Signed-off-by: Chong Gao <res_life@163.com>
Pushed a new commit: RAPIDS Accelerator side change: |
cwr.clean(); | ||
} | ||
})); | ||
Runtime.getRuntime().addShutdownHook(DEFAULT_SHUTDOWN_THREAD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pro: we are getting away without adding more switches
con: we are risking race conditions depending on the upper layer
* Runnable used to be added to Java default shutdown hook. | ||
* It checks the leaks at shutdown time. | ||
* If you want to register the Runnable in a custom shutdown hook manager instead of Java default shutdown hook, | ||
* should first remove it and then add it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link to the API for remove
* should first remove it and then add it. | |
* should first remove it using {@link #removeDefaultShutdownHook()} and then add it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
} | ||
|
||
public static void removeDefaultShutdownHook() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc for the public method?
to reduce the risk of race conditions we could change the signature of this method
public static Runnable removeDefaultShutdownHook()
after removing DEFAULT_SHUTDOWN_THREAD
it would return DEFAULT_SHUTDOWN_RUNNABLE
that the caller can register in their shutdown manager
DEFAULT_SHUTDOWN_RUNNABLE can be private
then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Hadoop ShutdownHookManager
to make sure leak checking is run after the Spark hooks
@gpucibot merge |
Contributes to NVIDIA/spark-rapids#5854
Problem
Prints
RapidsHostMemoryStore.pool
leaked error log when running Rapids Accelerator test cases.Root cause
RapidsHostMemoryStore.pool
is not closed beforeMemoryCleaner
checking the leaks.It's actually not a leak, it's caused by hooks execution order.
RapidsHostMemoryStore.pool
is closed in the Spark executor plugin hook.The close path is:
Rapids Accelerator JNI also checks leaks in a shutdown hook.
Shutdown hooks are executed concurrently, there is no execution order guarantee.
solution 1 - Not recommanded
Just wait one second before checking the leak in the
MemoryCleaner
.It's modifying debug code, it's modifying closing code, and has no impact on production code.
solution 2 - Not recommanded
Spark has a util class
ShutdownHookManager
which is a ShutdownHook wrapper.It can addShutdownHook with priority via
Hadoop ShutdownHookManager
Leveraging Hadoop ShutdownHookManager as Spark does is feasible.
Solution 3 Recommanded
Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager.
Signed-off-by: Chong Gao res_life@163.com