Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark-729: predictable closure capture #1322

Closed
wants to merge 10 commits into from
Closed

Conversation

willb
Copy link
Contributor

@willb willb commented Jul 7, 2014

SPARK-729 concerns when free variables in closure arguments to transformations are captured. Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created). This PR causes free variables in closure arguments to RDD transformations to be captured at closure creation time by modifying ClosureCleaner to serialize and deserialize its argument.

This PR is based on #189 (which is closed) but has fixes to work with some changes in 1.0. In particular, it ensures that the cloned Broadcast objects produced by closure capture are registered with ContextCleaner so that broadcast variables won't become invalid simply because variable capture (implemented this way) causes strong references to the original broadcast variables to go away.

(See #189 for additional discussion and background.)

willb added 10 commits July 4, 2014 11:26
This method allows code that needs access to the currently-active
ContextCleaner to access it via a DynamicVariable.
The two tests added to ClosureCleanerSuite ensure that variable values
are captured at RDD definition time, not at job-execution time.
The environments of serializable closures are now captured as
part of closure cleaning. Since we already proactively check most
closures for serializability, ClosureCleaner.clean now returns
the result of deserializing the serialized version of the cleaned
closure.

Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
There are two possible cases for runJob calls: either they are called
by RDD action methods from inside Spark or they are called from client
code. There's no need to proactively check the closure argument to
runJob for serializability or force variable capture in either case:

1. if they are called by RDD actions, their closure arguments consist
of mapping an already-serializable closure (with an already-frozen
environment) to each element in the RDD;

2. in both cases, the closure is about to execute and thus the benefit
of proactively checking for serializability (or ensuring immediate
variable capture) is nonexistent.

(Note that ensuring capture via serializability on closure arguments to
runJob also causes pyspark accumulators to fail to update.)

Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
This splits the test identifying expected failures due to
closure serializability into three cases.
Conflicts:
	core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala
Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16383/

@mateiz
Copy link
Contributor

mateiz commented Jul 23, 2014

Jenkins, add to whitelist and test this please

@mateiz
Copy link
Contributor

mateiz commented Jul 23, 2014

@rxin you may want to look at this with your broadcast change

@willb
Copy link
Contributor Author

willb commented Jul 23, 2014

@mateiz, I think there's a memory blowup somewhere in this patch as it is and am trying to track it down. Coincidentally, it's what I was switching context back to when I saw this comment.

@rxin, can you point me to the broadcast change so I can track it?

@SparkQA
Copy link

SparkQA commented Jul 23, 2014

QA tests have started for PR 1322. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17050/consoleFull

@rxin
Copy link
Contributor

rxin commented Jul 23, 2014

Matei was referring to #1498

@willb
Copy link
Contributor Author

willb commented Jul 23, 2014

Thanks, @rxin

@mateiz
Copy link
Contributor

mateiz commented Jul 23, 2014

Alright, just ping me (with @mateiz) when you think it's ready.

@mateiz
Copy link
Contributor

mateiz commented Aug 27, 2014

BTW @willb, if this is not ready, do you mind closing the PR and resending when it is? We'd like to minimize the number of open PRs that aren't actively being reviewed.

@willb
Copy link
Contributor Author

willb commented Aug 29, 2014

@mateiz sure; I've tracked down the problem but am a bit stumped by how to fix it. I'll reopen when I have a solution.

@willb willb closed this Aug 29, 2014
@mateiz
Copy link
Contributor

mateiz commented Aug 30, 2014

Alright, feel free to describe this on the JIRA too if you'd like input.

kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
Co-authored-by: Russell Spitzer <russell.spitzer@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants