[SPARK-5654] Integrate SparkR #5096

shivaram · 2015-03-19T23:35:13Z

This pull requests integrates SparkR, an R frontend for Spark. The SparkR package contains both RDD and DataFrame APIs in R and is integrated with Spark's submission scripts to work on different cluster managers.

Some integration points that would be great to get feedback on:

Build procedure: SparkR requires R to be installed on the machine to be built. Right now we have a new Maven profile -PsparkR that can be used to enable SparkR builds
YARN cluster mode: The R package that is built needs to be present on the driver and all the worker nodes during execution. The R package location is currently set using SPARK_HOME, but this might not work on YARN cluster mode.

The SparkR package represents the work of many contributors and attached below is a list of people along with areas they worked on

edwardt (@edwart) - Documentation improvements
Felix Cheung (@felixcheung) - Documentation improvements
Hossein Falaki (@falaki) - Documentation improvements
Chris Freeman (@cafreeman) - DataFrame API, Programming Guide
Todd Gao (@7c00) - R worker Internals
Ryan Hafen (@hafen) - SparkR Internals
Qian Huang (@hqzizania) - RDD API
Hao Lin (@hlin09) - RDD API, Closure cleaner
Evert Lammerts (@evertlammerts) - DataFrame API
Davies Liu (@davies) - DataFrame API, R worker internals, Merging with Spark
Yi Lu (@lythesia) - RDD API, Worker internals
Matt Massie (@massie) - Jenkins build
Harihar Nahak (@hnahak87) - SparkR examples
Oscar Olmedo (@oscaroboto) - Spark configuration
Antonio Piccolboni (@piccolbo) - SparkR examples, Namespace bug fixes
Dan Putler (@dputler) - Dataframe API, SparkR Install Guide
Ashutosh Raina (@ashutoshraina) - Build improvements
Josh Rosen (@JoshRosen) - Travis CI build
Sun Rui (@sun-rui)- RDD API, JVM Backend, Shuffle improvements
Shivaram Venkataraman (@shivaram) - RDD API, JVM Backend, Worker Internals
Zongheng Yang (@concretevitamin) - RDD API, Pipelined RDDs, Examples and EC2 guide

define generic for 'first' in RDD API

… column

[SPARKR-189] [SPARKR-190] Column and expression

… api Conflicts: pkg/R/RDD.R

… group Conflicts: pkg/NAMESPACE pkg/R/DataFrame.R pkg/R/utils.R pkg/inst/tests/test_sparkSQL.R

Conflicts: pkg/NAMESPACE pkg/R/DataFrame.R

Updated column to use `functions` instead of `Dsl` in accordance with the new API changes. Also created separate classes for `asc` and `desc`.

New 1.3 repo and updates to `column.R`

New DataFrame methods: - `join` - `sort` - `orderBy` - `filter` - `where`

… api Conflicts: pkg/src/build.sbt

[SparkR-209] `join`, `sort`, `filter` methods for DataFrame

shivaram · 2015-04-08T17:10:54Z

Jenkins, retest this please (is the fourth time lucky ?)

SparkQA · 2015-04-08T17:13:30Z

Test build #29870 has started for PR 5096 at commit 59266d1.

AmplabJenkins · 2015-04-08T19:23:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29870/
Test FAILed.

brennonyork · 2015-04-08T20:30:08Z

@shivaram a few things after looking at the build code some more...

The timeout value comes from the line here in dev/run-tests-jenkins. Its currently set at 120 minutes and doesn't include the time it takes for PR's to be tested against the master branch (i.e. for dependencies). We could certainly up that value, but I'd ask that since, I'm assuming, the dev/run-tests script on this PR runs all the new SparkR tests (plus any additional for core Spark you've added), that you run dev/run-tests locally and, for whatever additional time is needed, update the timeout in dev/run-tests-jenkins for this PR. The impetus for running locally first is that I'd much rather get a baseline for what it takes for all the new tests to run and then add 15ish minutes for fluff rather than throw a number into the wind.
Completely agree we should get some timing metrics for the various PR tests (thanks for the idea!). I'll generate a JIRA for that and take a look soon. That said, just to reiterate, those tests are not holding up the actual Spark test suite from finishing unless Jenkins has some deeper timing hooks than I know about. I assume though that it's merely a factor of the large corpus tests that were likely added in this PR.

andrewor14 · 2015-04-08T21:13:25Z

core/src/main/scala/org/apache/spark/api/r/SerDe.scala

+
+  def readBoolean(in: DataInputStream): Boolean = {
+    val intVal = in.readInt()
+    if (intVal == 0) false else true


can be intVal != 0

andrewor14 · 2015-04-08T21:15:31Z

SparkSubmit parts LGTM. We should merge this soon so people can start testing this well in advance of the release window.

shivaram · 2015-04-08T21:20:34Z

@brennonyork The overall Jenkins build runner has a timeout of 130 minutes right now (cc @shaneknapp). So all the RAT tests, Mima checks, style checks, new dependencies plus all the unit tests have to run within 130 minutes and this PR seems to be failing that.

@shaneknapp can we increase the 130 min timeout to say 140 minutes ?

shaneknapp · 2015-04-08T21:23:13Z

i'll up it to 180, just so we have some headroom.

On Wed, Apr 8, 2015 at 2:21 PM, Shivaram Venkataraman <
notifications@github.com> wrote:

@brennonyork https://github.com/brennonyork The overall Jenkins build
runner has a timeout of 130 minutes right now (cc @shaneknapp
https://github.com/shaneknapp). So all the RAT tests, Mima checks,
style checks, new dependencies plus all the unit tests have to run within
130 minutes and this PR seems to be failing that.

@shaneknapp https://github.com/shaneknapp can we increase the 130 min
timeout to say 140 minutes ?

—
Reply to this email directly or view it on GitHub
#5096 (comment).

shivaram · 2015-04-08T21:26:37Z

Thanks @shaneknapp ! Could you re-trigger this build once its upped ?

shaneknapp · 2015-04-08T21:29:08Z

jenkins, test this please

shaneknapp · 2015-04-08T21:29:37Z

also, i can't believe how long this build is... sad panda etc.

SparkQA · 2015-04-08T21:33:39Z

Test build #29894 has started for PR 5096 at commit 59266d1.

JoshRosen · 2015-04-08T22:16:04Z

also, i can't believe how long this build is... sad panda etc.

Test parallelization is going to be a lot of work, but I think we could see huge speedups for the pull request builders if we didn't run all tests for every PR. Most PRs touch the higher-level libraries and not core, so it should be safe to skip most of the tests if core hasn't been modified.

SparkQA · 2015-04-08T23:47:39Z

Test build #29894 has finished for PR 5096 at commit 59266d1.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-08T23:47:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29894/
Test FAILed.

SparkQA · 2015-04-09T00:53:30Z

Test build #29908 has started for PR 5096 at commit bac3a6b.

pwendell · 2015-04-09T01:25:16Z

@shivaram - hey one thing I forgot to ask, how much time do the SparkR tests add to the overall Spark tests?

shivaram · 2015-04-09T01:54:23Z

@pwendell Its around 2 minutes on my laptop. Here is the output on my machine

time ./run-tests.sh
....
....
./run-tests.sh  1:56.96 total

SparkQA · 2015-04-09T02:40:42Z

Test build #29908 has finished for PR 5096 at commit bac3a6b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-09T02:40:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29908/
Test FAILed.

SparkQA · 2015-04-09T03:22:48Z

Test build #29919 has started for PR 5096 at commit da64742.

SparkQA · 2015-04-09T05:39:06Z

Test build #29919 has finished for PR 5096 at commit da64742.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-09T05:39:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29919/
Test PASSed.

shivaram · 2015-04-09T05:42:32Z

Thanks @andrewor14 @pwendell for the reviews. Now that Jenkins is happy I am going merge this in and I'll file follow up issues for things like YARN cluster mode which we didn't get to in this PR.

concretevitamin · 2015-04-09T06:02:56Z

👍

davies and others added 30 commits March 2, 2015 13:47

fix first(0

71d66a1

define generic for 'first' in RDD API

e998356

Fix brackets

f585929

return object instead of a list of one object

1955a09

Merge pull request apache#192 from cafreeman/sparkr-sql

76cf2e0

define generic for 'first' in RDD API

Updates as per feedback on sparkR-submit

03402eb

Update DataFrame.R

1d0f2ae

Update column.R

f798402

Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into…

524c122

… column

Merge pull request apache#188 from davies/column

8a676b1

[SPARKR-189] [SPARKR-190] Column and expression

launch R worker by a daemon

06cbc2d

Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into…

3beadcf

… api Conflicts: pkg/R/RDD.R

Fixed small typos

e2d144a

fix test and docs

98cc97a

Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into…

39c253d

… group Conflicts: pkg/NAMESPACE pkg/R/DataFrame.R pkg/R/utils.R pkg/inst/tests/test_sparkSQL.R

Merge branch 'group' of github.com:davies/SparkR-pkg into group

03bcf20

Conflicts: pkg/NAMESPACE pkg/R/DataFrame.R

address comments

ed9a89f

New 1.3 repo and updates to column.R

e8639c3

Updated column to use `functions` instead of `Dsl` in accordance with the new API changes. Also created separate classes for `asc` and `desc`.

shutdown the JVM after R process die

2b6f980

Merge pull request apache#195 from cafreeman/sparkr-sql

3f22c8d

New 1.3 repo and updates to `column.R`

Refactor join generic for use with DataFrame

4fa6343

join, sort, and filter

294ca4a

New DataFrame methods: - `join` - `sort` - `orderBy` - `filter` - `where`

Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into…

12a6db2

… api Conflicts: pkg/src/build.sbt

fix tests

8ff29d6

small update on yarn deploy mode.

2e7b190

Fixed indent in join test.

32b37d1

selectExpr

e14c328

update export

494a4dd

Merge pull request apache#197 from cafreeman/sparkr-sql

cd7ac8a

[SparkR-209] `join`, `sort`, `filter` methods for DataFrame

Merge branch 'dfMethods' into sparkr-sql

74269f3

andrewor14 reviewed Apr 8, 2015
View reviewed changes

fix Date serialization

da64742

davies force-pushed the R branch from bac3a6b to da64742 Compare April 9, 2015 03:19

asfgit closed this in 2fe0a1a Apr 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5654] Integrate SparkR #5096

[SPARK-5654] Integrate SparkR #5096

shivaram commented Mar 19, 2015

shivaram commented Apr 8, 2015

SparkQA commented Apr 8, 2015

AmplabJenkins commented Apr 8, 2015

brennonyork commented Apr 8, 2015

andrewor14 Apr 8, 2015

andrewor14 commented Apr 8, 2015

shivaram commented Apr 8, 2015

shaneknapp commented Apr 8, 2015

shivaram commented Apr 8, 2015

shaneknapp commented Apr 8, 2015

shaneknapp commented Apr 8, 2015

SparkQA commented Apr 8, 2015

JoshRosen commented Apr 8, 2015

SparkQA commented Apr 8, 2015

AmplabJenkins commented Apr 8, 2015

SparkQA commented Apr 9, 2015

pwendell commented Apr 9, 2015

shivaram commented Apr 9, 2015

SparkQA commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

SparkQA commented Apr 9, 2015

SparkQA commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

shivaram commented Apr 9, 2015

concretevitamin commented Apr 9, 2015

[SPARK-5654] Integrate SparkR #5096

[SPARK-5654] Integrate SparkR #5096

Conversation

shivaram commented Mar 19, 2015

shivaram commented Apr 8, 2015

SparkQA commented Apr 8, 2015

AmplabJenkins commented Apr 8, 2015

brennonyork commented Apr 8, 2015

andrewor14 Apr 8, 2015

Choose a reason for hiding this comment

andrewor14 commented Apr 8, 2015

shivaram commented Apr 8, 2015

shaneknapp commented Apr 8, 2015

shivaram commented Apr 8, 2015

shaneknapp commented Apr 8, 2015

shaneknapp commented Apr 8, 2015

SparkQA commented Apr 8, 2015

JoshRosen commented Apr 8, 2015

SparkQA commented Apr 8, 2015

AmplabJenkins commented Apr 8, 2015

SparkQA commented Apr 9, 2015

pwendell commented Apr 9, 2015

shivaram commented Apr 9, 2015

SparkQA commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

SparkQA commented Apr 9, 2015

SparkQA commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

shivaram commented Apr 9, 2015

concretevitamin commented Apr 9, 2015