Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch for SPARK-942 #50

Closed
wants to merge 29 commits into from
Closed

Conversation

kellrott
Copy link
Contributor

@kellrott kellrott commented Mar 1, 2014

This is a port of a pull request original targeted at incubator-spark: https://github.com/apache/incubator-spark/pull/180

Essentially if a user returns a generative iterator (from a flatMap operation), when trying to persist the data, Spark would first unroll the iterator into an ArrayBuffer, and then try to figure out if it could store the data. In cases where the user provided an iterator that generated more data then available memory, this would case a crash. With this patch, if the user requests a persist with a 'StorageLevel.DISK_ONLY', the iterator will be unrolled as it is inputed into the serializer.

To do this, two changes where made:

  1. The type of the 'values' argument in the putValues method of the BlockStore interface was changed from ArrayBuffer to Iterator (and all code interfacing with this method was modified to connect correctly.
  2. The JavaSerializer now calls the ObjectOutputStream 'reset' method every 1000 objects. This was done because the ObjectOutputStream caches objects (thus preventing them from being GC'd) to write more compact serialization. If reset is never called, eventually the memory fills up, if it is called too often then the serialization streams become much larger because of redundant class descriptions.

…the serializer when a 'DISK_ONLY' persist is called.

This is in response to SPARK-942.
…ffer objects. This was previously done higher up the stack.
Conflicts:
	core/src/main/scala/org/apache/spark/CacheManager.scala
… system variable 'spark.serializer.objectStreamReset', default is not 10000.
…Buffer (rather then an Iterator).

This will allow BlockStores to have slightly different behaviors dependent on whether they get an
Iterator or ArrayBuffer. In the case of the MemoryStore, it needs to duplicate and cache an Iterator
into an ArrayBuffer, but if handed a ArrayBuffer, it can skip the duplication.
…5 seconds. Confirmed that it still crashes an unpatched copy of Spark.
…rs. It doesn't try to invoke a OOM error any more
…. Now using trait 'Values'. Also modified BlockStore.putBytes call to return PutResult, so that it behaves like putValues.
…k into iterator-to-disk

Conflicts:
	core/src/test/scala/org/apache/spark/storage/LargeIteratorSuite.scala
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12985/

@kellrott
Copy link
Contributor Author

kellrott commented Mar 6, 2014

I think I've covered all the formatting requests. Any other issues?

@pwendell
Copy link
Contributor

pwendell commented Mar 6, 2014

Thanks @kellrott for this patch - sorry it took us a long time to review it. I'm going to merge this now.

@pwendell
Copy link
Contributor

pwendell commented Mar 6, 2014

I've created SPARK-1201 (https://spark-project.atlassian.net/browse/SPARK-1201) to cover optimizations in cases other than DISK_ONLY.

@asfgit asfgit closed this in 40566e1 Mar 6, 2014
jhartlaub referenced this pull request in jhartlaub/spark May 27, 2014
Fix race condition in SparkListenerSuite (fixes SPARK-908).

(cherry picked from commit 215238c)
Signed-off-by: Reynold Xin <rxin@apache.org>
vlad17 pushed a commit to vlad17/spark that referenced this pull request Aug 23, 2016
## What changes were proposed in this pull request?
In Databricks, `SPARK_DIST_CLASSPATH` are used for driver classpath and `SPARK_JARS_DIR` is empty. So, we need to add `SPARK_DIST_CLASSPATH` to the `LAUNCH_CLASSPATH`. We cannot remove `SPARK_JARS_DIR` because Spark unit tests are actually using it.

Author: Yin Huai <yhuai@databricks.com>

Closes apache#50 from yhuai/Add-SPARK_DIST_CLASSPATH-toLAUNCH_CLASSPATH.
clockfly pushed a commit to clockfly/spark that referenced this pull request Aug 30, 2016
## What changes were proposed in this pull request?
In Databricks, `SPARK_DIST_CLASSPATH` are used for driver classpath and `SPARK_JARS_DIR` is empty. So, we need to add `SPARK_DIST_CLASSPATH` to the `LAUNCH_CLASSPATH`. We cannot remove `SPARK_JARS_DIR` because Spark unit tests are actually using it.

Author: Yin Huai <yhuai@databricks.com>

Closes apache#50 from yhuai/Add-SPARK_DIST_CLASSPATH-toLAUNCH_CLASSPATH.
ash211 added a commit to ash211/spark that referenced this pull request Jan 31, 2017
* Create README to better describe project purpose

* Add links to usage guide and dev docs

* Minor changes
lins05 pushed a commit to lins05/spark that referenced this pull request Apr 23, 2017
* Create README to better describe project purpose

* Add links to usage guide and dev docs

* Minor changes
erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017
* Create README to better describe project purpose

* Add links to usage guide and dev docs

* Minor changes
jlopezmalla pushed a commit to jlopezmalla/spark that referenced this pull request Sep 13, 2017
marcosdotps pushed a commit to marcosdotps/spark that referenced this pull request Sep 13, 2017
* Refactor and Test of ConfigSecurity

* [SPK-64] removed ssl tricks on spark-env (apache#50)
jlopezmalla pushed a commit to jlopezmalla/spark that referenced this pull request Nov 3, 2017
* removed ssl tricks on spark-env

* test phase activated

* added changes requested from jlopez-malla

* changed properties and fixed typos

* changed signature for methods
gcz2022 pushed a commit to gcz2022/spark that referenced this pull request Jul 30, 2018
Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018
luzhonghao pushed a commit to luzhonghao/spark that referenced this pull request Dec 11, 2018
cloud-fan pushed a commit to cloud-fan/spark that referenced this pull request Jan 16, 2019
mccheah pushed a commit to mccheah/spark that referenced this pull request Feb 14, 2019
hejian991 pushed a commit to growingio/spark that referenced this pull request Jun 24, 2019
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Enable Octavia in LBaaS test of terraform-openstack-provider
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Oct 19, 2019
…-spark:bump_lineage_logging_211 to netflix/2.1.1-unstable

Squashed commit of the following:

commit 347c0be48e6613b07d67b6efa9247e116b3a99b2
Author: Daniel Watson <dwatson@netflix.com>
Date:   Tue Oct 8 09:55:43 2019 -0700

    NETFLIX-BUILD: Bump lineage-logging to 0.1.20
fishcus pushed a commit to fishcus/spark that referenced this pull request Jul 8, 2020
* apache#49 add more metrics to application-source

* upgrade hadoop to 2.7.1

* apache#49 add request_cores to master json

* Revert "upgrade hadoop to 2.7.1"

This reverts commit 2db019d.

* upgrade kylin to 2.4.1-kylin-r38

* fix ut
microbearz added a commit to microbearz/spark that referenced this pull request Dec 15, 2020
* apache#49 add more metrics to application-source

* upgrade hadoop to 2.7.1

* apache#49 add request_cores to master json

* Revert "upgrade hadoop to 2.7.1"

This reverts commit 2db019d.

* upgrade kylin to 2.4.1-kylin-r38

* fix ut
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants