Patch for SPARK-942 #50

kellrott · 2014-03-01T00:25:33Z

This is a port of a pull request original targeted at incubator-spark: https://github.com/apache/incubator-spark/pull/180

Essentially if a user returns a generative iterator (from a flatMap operation), when trying to persist the data, Spark would first unroll the iterator into an ArrayBuffer, and then try to figure out if it could store the data. In cases where the user provided an iterator that generated more data then available memory, this would case a crash. With this patch, if the user requests a persist with a 'StorageLevel.DISK_ONLY', the iterator will be unrolled as it is inputed into the serializer.

To do this, two changes where made:

The type of the 'values' argument in the putValues method of the BlockStore interface was changed from ArrayBuffer to Iterator (and all code interfacing with this method was modified to connect correctly.
The JavaSerializer now calls the ObjectOutputStream 'reset' method every 1000 objects. This was done because the ObjectOutputStream caches objects (thus preventing them from being GC'd) to write more compact serialization. If reset is never called, eventually the memory fills up, if it is called too often then the serialization streams become much larger because of redundant class descriptions.

…the serializer when a 'DISK_ONLY' persist is called. This is in response to SPARK-942.

…ffer objects. This was previously done higher up the stack.

Conflicts: core/src/main/scala/org/apache/spark/CacheManager.scala

…ood data in the LargeIteratorSuite

… system variable 'spark.serializer.objectStreamReset', default is not 10000.

…System property.

…Buffer (rather then an Iterator). This will allow BlockStores to have slightly different behaviors dependent on whether they get an Iterator or ArrayBuffer. In the case of the MemoryStore, it needs to duplicate and cache an Iterator into an ArrayBuffer, but if handed a ArrayBuffer, it can skip the duplication.

…5 seconds. Confirmed that it still crashes an unpatched copy of Spark.

…rs. It doesn't try to invoke a OOM error any more

…. Now using trait 'Values'. Also modified BlockStore.putBytes call to return PutResult, so that it behaves like putValues.

…k into iterator-to-disk Conflicts: core/src/test/scala/org/apache/spark/storage/LargeIteratorSuite.scala

AmplabJenkins · 2014-03-01T00:28:51Z

Merged build triggered.

AmplabJenkins · 2014-03-01T00:28:52Z

Merged build started.

AmplabJenkins · 2014-03-01T00:29:05Z

Merged build triggered.

AmplabJenkins · 2014-03-04T07:22:28Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12985/

kellrott · 2014-03-06T20:12:10Z

I think I've covered all the formatting requests. Any other issues?

pwendell · 2014-03-06T22:50:29Z

Thanks @kellrott for this patch - sorry it took us a long time to review it. I'm going to merge this now.

pwendell · 2014-03-06T22:52:31Z

I've created SPARK-1201 (https://spark-project.atlassian.net/browse/SPARK-1201) to cover optimizations in cases other than DISK_ONLY.

Fix race condition in SparkListenerSuite (fixes SPARK-908). (cherry picked from commit 215238c) Signed-off-by: Reynold Xin <rxin@apache.org>

## What changes were proposed in this pull request? In Databricks, `SPARK_DIST_CLASSPATH` are used for driver classpath and `SPARK_JARS_DIR` is empty. So, we need to add `SPARK_DIST_CLASSPATH` to the `LAUNCH_CLASSPATH`. We cannot remove `SPARK_JARS_DIR` because Spark unit tests are actually using it. Author: Yin Huai <yhuai@databricks.com> Closes apache#50 from yhuai/Add-SPARK_DIST_CLASSPATH-toLAUNCH_CLASSPATH.

* Create README to better describe project purpose * Add links to usage guide and dev docs * Minor changes

* Refactor and Test of ConfigSecurity * [SPK-64] removed ssl tricks on spark-env (apache#50)

* removed ssl tricks on spark-env * test phase activated * added changes requested from jlopez-malla * changed properties and fixed typos * changed signature for methods

new cosmos

Enable Octavia in LBaaS test of terraform-openstack-provider

…-spark:bump_lineage_logging_211 to netflix/2.1.1-unstable Squashed commit of the following: commit 347c0be48e6613b07d67b6efa9247e116b3a99b2 Author: Daniel Watson <dwatson@netflix.com> Date: Tue Oct 8 09:55:43 2019 -0700 NETFLIX-BUILD: Bump lineage-logging to 0.1.20

* apache#49 add more metrics to application-source * upgrade hadoop to 2.7.1 * apache#49 add request_cores to master json * Revert "upgrade hadoop to 2.7.1" This reverts commit 2db019d. * upgrade kylin to 2.4.1-kylin-r38 * fix ut

kellrott added 27 commits November 12, 2013 16:32

Changing CacheManager and BlockManager to pass iterators directly to …

efe1102

…the serializer when a 'DISK_ONLY' persist is called. This is in response to SPARK-942.

Fixing MemoryStore, so that it converts incoming iterators to ArrayBu…

cac1fad

…ffer objects. This was previously done higher up the stack.

Merge remote-tracking branch 'origin/master' into iterator-to-disk

d32992f

Adding unit test for straight to disk iterator methods.

81d670c

Merge branch 'master' into iterator-to-disk

f403826

Changing the JavaSerializer reset to occur every 1000 objects.

5eb2b7e

Adding some comments.

44ec35a

Merge branch 'master' into iterator-to-disk

56f71cd

Conflicts: core/src/main/scala/org/apache/spark/CacheManager.scala

Simplifying StorageLevel checks

95c7f67

Deleting temp output directory when done

0e6f808

Fixing dumb mistake ("||" instead of "&&")

2eeda75

Wrapping long line

a6424ba

Added check to make sure that streamed-to-dist RDD actually returns g…

9df0276

…ood data in the LargeIteratorSuite

Removing un-needed semi-colons

31fe08e

Removing rouge space

40fe1d7

Making the Java ObjectStreamSerializer reset rate configurable by the…

00c98e0

… system variable 'spark.serializer.objectStreamReset', default is not 10000.

Merge branch 'master' into iterator-to-disk

8644ee8

Fixing the JavaSerializer to read from the SparkConf rather then the …

656c33e

…System property.

Wrapping a few long lines

627a8b7

Removing more un-needed array-buffer to iterator conversions

c2fb430

Streamlined the LargeIteratorSuite unit test. It should now run in ~2…

16a4cea

…5 seconds. Confirmed that it still crashes an unpatched copy of Spark.

Moving the 'LargeIteratorSuite' to simply test persistance of iterato…

7ccc74b

…rs. It doesn't try to invoke a OOM error any more

Adding docs for spark.serializer.objectStreamReset configuration

f70d069

Refactoring the BlockManager to replace the Either[Either[A,B]] usage…

2f684ea

…. Now using trait 'Values'. Also modified BlockStore.putBytes call to return PutResult, so that it behaves like putValues.

Merge branch 'iterator-to-disk' of github.com:kellrott/incubator-spar…

33ac390

…k into iterator-to-disk Conflicts: core/src/test/scala/org/apache/spark/storage/LargeIteratorSuite.scala

Merge ../incubator-spark into iterator-to-disk

8aa31cd

asfgit closed this in 40566e1 Mar 6, 2014

jhartlaub referenced this pull request in jhartlaub/spark May 27, 2014

Merge pull request alteryx#50 from kayousterhout/SPARK-908

0b6f047

Fix race condition in SparkListenerSuite (fixes SPARK-908). (cherry picked from commit 215238c) Signed-off-by: Reynold Xin <rxin@apache.org>

ash211 added a commit to ash211/spark that referenced this pull request Jan 31, 2017

Create README to better describe project purpose (apache#50)

ccb2e2f

* Create README to better describe project purpose * Add links to usage guide and dev docs * Minor changes

lins05 pushed a commit to lins05/spark that referenced this pull request Apr 23, 2017

Create README to better describe project purpose (apache#50)

3b5901a

* Create README to better describe project purpose * Add links to usage guide and dev docs * Minor changes

erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017

Create README to better describe project purpose (apache#50)

9124aac

* Create README to better describe project purpose * Add links to usage guide and dev docs * Minor changes

jlopezmalla pushed a commit to jlopezmalla/spark that referenced this pull request Sep 13, 2017

[SPK-64] removed ssl tricks on spark-env (apache#50)

a663c02

marcosdotps pushed a commit to marcosdotps/spark that referenced this pull request Sep 13, 2017

Cherry pick (apache#58)

678d4c8

* Refactor and Test of ConfigSecurity * [SPK-64] removed ssl tricks on spark-env (apache#50)

gcz2022 pushed a commit to gcz2022/spark that referenced this pull request Jul 30, 2018

set correct execution Id for broadcast query stage (apache#50)

5971971

Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018

Merge pull request apache#50 from mesosphere/new-cosmos

e804624

new cosmos

luzhonghao pushed a commit to luzhonghao/spark that referenced this pull request Dec 11, 2018

set correct execution Id for broadcast query stage (apache#50)

cf4265b

cloud-fan pushed a commit to cloud-fan/spark that referenced this pull request Jan 16, 2019

set correct execution Id for broadcast query stage (apache#50)

3487eb8

mccheah pushed a commit to mccheah/spark that referenced this pull request Feb 14, 2019

set correct execution Id for broadcast query stage (apache#50)

f3b7dee

hejian991 pushed a commit to growingio/spark that referenced this pull request Jun 24, 2019

set correct execution Id for broadcast query stage (apache#50)

30fe0e0

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#50 from theopenlab/enable-octavia

0d74da7

Enable Octavia in LBaaS test of terraform-openstack-provider

wangyum mentioned this pull request Mar 11, 2020

[WIP][SPARK-31114][SQL] Constraints inferred from equality constraints with cast #27874

Closed

holdenk mentioned this pull request May 1, 2020

[SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned #28370

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch for SPARK-942 #50

Patch for SPARK-942 #50

kellrott commented Mar 1, 2014

AmplabJenkins commented Mar 1, 2014

AmplabJenkins commented Mar 1, 2014

AmplabJenkins commented Mar 1, 2014

AmplabJenkins commented Mar 4, 2014

kellrott commented Mar 6, 2014

pwendell commented Mar 6, 2014

pwendell commented Mar 6, 2014

Patch for SPARK-942 #50

Patch for SPARK-942 #50

Conversation

kellrott commented Mar 1, 2014

AmplabJenkins commented Mar 1, 2014

AmplabJenkins commented Mar 1, 2014

AmplabJenkins commented Mar 1, 2014

AmplabJenkins commented Mar 4, 2014

kellrott commented Mar 6, 2014

pwendell commented Mar 6, 2014

pwendell commented Mar 6, 2014