-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7689] Deprecate spark.cleaner.ttl #6220
Conversation
/cc @tdas |
Test FAILed. |
Jenkins, retest this please. |
Merged build started. |
Merged build finished. Test FAILed. |
Test FAILed. |
Test build #818 has started for PR 6220 at commit |
Jenkins, retest this please. |
Merged build triggered. |
Merged build started. |
Test build #32968 has started for PR 6220 at commit |
Test build #32968 has finished for PR 6220 at commit
|
Merged build finished. Test FAILed. |
Test FAILed. |
Jenkins, retest this please. |
Merged build triggered. |
Merged build started. |
Test build #32971 has started for PR 6220 at commit |
This is tricky actually. The problem with reference-tracking based mechanism that we have is not bullet-proof as it depends on GC behavior in the driver. This is a problem that I have seen in Spark Streaming programs as well. For driver with large heaps, some RDD that is not in scope may not got dereferenced for a long time, until a full GC occurs. And in the mean time, nothing will get cleaned. The solutions to that is to call System.gc() at some interval, say one hour. So I am wondering whether this deprecation message should cover that aspect or not. |
Test build #32971 has finished for PR 6220 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
It's probably worth documenting the driver The TTL-based mechanism won't work in many cases, such as streaming jobs that join streaming and historical data. Given that there are so many corner-cases where TTL might not work as expected, I'm in favor of removing the documentation. I think that there's probably only a handful of power users who would be able to use this safely while understanding all of the corner-cases. We can still leave the setting in, but I'd like to avoid having a documented setting that's so unsafe to use. If you feel strongly that it should be documented, then I can see about updating its doc to give more warnings about the corner-cases. |
I wonder if we should do the
|
An internal If the driver fills up with too much in-memory metadata, then the GC will kick in and clean it up, so I guess we're only worried about cases where we run out of a non-memory resource, such as disk space, because GC wasn't run on the driver. You can probably back-of-the-envelope calculate the right GC interval based on your disk capacity and the maximum write throughput of your disks: if you have 100 gigabytes of temporary space for shuffle files and can only write at a maximum speed of 100 MB/s, then running GC at least once every 15 minutes should be sufficient to prevent the disks from filling up (since 100 gigabytes / (100 megabytes / second) ~= 16.5 minutes to fill the disks). |
Yeah, since it's a best-effort thing maybe it makes sense to do it more frequently. 15 minutes sounds fine to me. |
Sounds good to me. Question is do we add it Spark 1.4? TD On Tue, May 19, 2015 at 1:20 PM, andrewor14 notifications@github.com
|
We might be able to add this to 1.4 if we feature-flag it as off-by-default. We can recommend this as a replacement for |
i am not sure if it is a good idea to completely remove MetadataCleaner. On Wed, May 20, 2015 at 5:47 PM, Josh Rosen notifications@github.com
|
Ok, but we should have ugly paragraph warnings that explain why it's a bad idea. |
@@ -478,7 +478,12 @@ private[spark] object SparkConf extends Logging { | |||
DeprecatedConfig("spark.kryoserializer.buffer.mb", "1.4", | |||
"Please use spark.kryoserializer.buffer instead. The default value for " + | |||
"spark.kryoserializer.buffer.mb was previously specified as '0.064'. Fractional values " + | |||
"are no longer accepted. To specify the equivalent now, one may use '64k'.") | |||
"are no longer accepted. To specify the equivalent now, one may use '64k'."), | |||
DeprecatedConfig("spark.cleaner.ttl", "1.4", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops looks like this needs to be 1.5 now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this patch is out of date; I was waiting until I had time to do the period GC timer feature; feel free to work-steal if you want to pick this up :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, it's all yours...
I'm not actively working on this and won't have time to get to it for a while, so I'm going to close this PR and unassign the JIRA in the hopes that someone else can take over. |
With the introduction of
ContextCleaner
(in #126), I think there's no longer any reason for users to enable the MetadataCleaner /spark.cleaner.ttl
. This patch removes the last remaining documentation forspark.cleaner.ttl
and logs a deprecation warning if it is used.I think that this configuration used to be relevant for Spark Streaming jobs, but I think that's no longer the case since the latest Streaming docs have removed all mentions of
spark.cleaner.ttl
(see https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817, for example). The TTL-based cleaning is not safe and may prematurely clean resources that are still being used, leading to confusing errors (such as https://issues.apache.org/jira/browse/SPARK-5594), so it generally should not be enabled (see http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html for an old, related discussion).The only use-case that I can think of is super-long-lived Spark REPLs where you're worried about orphaning RDDs or broadcast variables in your REPL history and having them never get cleaned up, but I don't know that anyone uses
spark.cleaner.ttl
for this in practice.