Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5654] Integrate SparkR #5096

Closed
wants to merge 940 commits into from
Closed

[SPARK-5654] Integrate SparkR #5096

wants to merge 940 commits into from

Conversation

shivaram
Copy link
Contributor

This pull requests integrates SparkR, an R frontend for Spark. The SparkR package contains both RDD and DataFrame APIs in R and is integrated with Spark's submission scripts to work on different cluster managers.

Some integration points that would be great to get feedback on:

  1. Build procedure: SparkR requires R to be installed on the machine to be built. Right now we have a new Maven profile -PsparkR that can be used to enable SparkR builds
  2. YARN cluster mode: The R package that is built needs to be present on the driver and all the worker nodes during execution. The R package location is currently set using SPARK_HOME, but this might not work on YARN cluster mode.

The SparkR package represents the work of many contributors and attached below is a list of people along with areas they worked on

edwardt (@edwart) - Documentation improvements
Felix Cheung (@felixcheung) - Documentation improvements
Hossein Falaki (@falaki) - Documentation improvements
Chris Freeman (@cafreeman) - DataFrame API, Programming Guide
Todd Gao (@7c00) - R worker Internals
Ryan Hafen (@hafen) - SparkR Internals
Qian Huang (@hqzizania) - RDD API
Hao Lin (@hlin09) - RDD API, Closure cleaner
Evert Lammerts (@evertlammerts) - DataFrame API
Davies Liu (@davies) - DataFrame API, R worker internals, Merging with Spark
Yi Lu (@lythesia) - RDD API, Worker internals
Matt Massie (@massie) - Jenkins build
Harihar Nahak (@hnahak87) - SparkR examples
Oscar Olmedo (@oscaroboto) - Spark configuration
Antonio Piccolboni (@piccolbo) - SparkR examples, Namespace bug fixes
Dan Putler (@dputler) - Dataframe API, SparkR Install Guide
Ashutosh Raina (@ashutoshraina) - Build improvements
Josh Rosen (@JoshRosen) - Travis CI build
Sun Rui (@sun-rui)- RDD API, JVM Backend, Shuffle improvements
Shivaram Venkataraman (@shivaram) - RDD API, JVM Backend, Worker Internals
Zongheng Yang (@concretevitamin) - RDD API, Pipelined RDDs, Examples and EC2 guide

davies and others added 30 commits March 2, 2015 13:47
define generic for 'first' in RDD API
[SPARKR-189] [SPARKR-190] Column and expression
… group

Conflicts:
	pkg/NAMESPACE
	pkg/R/DataFrame.R
	pkg/R/utils.R
	pkg/inst/tests/test_sparkSQL.R
Updated column to use `functions` instead of `Dsl` in accordance with the new API changes.

Also created separate classes for `asc` and `desc`.
New 1.3 repo and updates to `column.R`
New DataFrame methods:

- `join`
- `sort`
- `orderBy`
- `filter`
- `where`
[SparkR-209] `join`, `sort`, `filter` methods for DataFrame
@shivaram
Copy link
Contributor Author

shivaram commented Apr 8, 2015

Jenkins, retest this please (is the fourth time lucky ?)

@SparkQA
Copy link

SparkQA commented Apr 8, 2015

Test build #29870 has started for PR 5096 at commit 59266d1.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29870/
Test FAILed.

@brennonyork
Copy link

@shivaram a few things after looking at the build code some more...

  1. The timeout value comes from the line here in dev/run-tests-jenkins. Its currently set at 120 minutes and doesn't include the time it takes for PR's to be tested against the master branch (i.e. for dependencies). We could certainly up that value, but I'd ask that since, I'm assuming, the dev/run-tests script on this PR runs all the new SparkR tests (plus any additional for core Spark you've added), that you run dev/run-tests locally and, for whatever additional time is needed, update the timeout in dev/run-tests-jenkins for this PR. The impetus for running locally first is that I'd much rather get a baseline for what it takes for all the new tests to run and then add 15ish minutes for fluff rather than throw a number into the wind.
  2. Completely agree we should get some timing metrics for the various PR tests (thanks for the idea!). I'll generate a JIRA for that and take a look soon. That said, just to reiterate, those tests are not holding up the actual Spark test suite from finishing unless Jenkins has some deeper timing hooks than I know about. I assume though that it's merely a factor of the large corpus tests that were likely added in this PR.


def readBoolean(in: DataInputStream): Boolean = {
val intVal = in.readInt()
if (intVal == 0) false else true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be intVal != 0

@andrewor14
Copy link
Contributor

SparkSubmit parts LGTM. We should merge this soon so people can start testing this well in advance of the release window.

@shivaram
Copy link
Contributor Author

shivaram commented Apr 8, 2015

@brennonyork The overall Jenkins build runner has a timeout of 130 minutes right now (cc @shaneknapp). So all the RAT tests, Mima checks, style checks, new dependencies plus all the unit tests have to run within 130 minutes and this PR seems to be failing that.

@shaneknapp can we increase the 130 min timeout to say 140 minutes ?

@shaneknapp
Copy link
Contributor

i'll up it to 180, just so we have some headroom.

On Wed, Apr 8, 2015 at 2:21 PM, Shivaram Venkataraman <
notifications@github.com> wrote:

@brennonyork https://github.com/brennonyork The overall Jenkins build
runner has a timeout of 130 minutes right now (cc @shaneknapp
https://github.com/shaneknapp). So all the RAT tests, Mima checks,
style checks, new dependencies plus all the unit tests have to run within
130 minutes and this PR seems to be failing that.

@shaneknapp https://github.com/shaneknapp can we increase the 130 min
timeout to say 140 minutes ?


Reply to this email directly or view it on GitHub
#5096 (comment).

@shivaram
Copy link
Contributor Author

shivaram commented Apr 8, 2015

Thanks @shaneknapp ! Could you re-trigger this build once its upped ?

@shaneknapp
Copy link
Contributor

jenkins, test this please

@shaneknapp
Copy link
Contributor

also, i can't believe how long this build is... sad panda etc.

@SparkQA
Copy link

SparkQA commented Apr 8, 2015

Test build #29894 has started for PR 5096 at commit 59266d1.

@JoshRosen
Copy link
Contributor

also, i can't believe how long this build is... sad panda etc.

Test parallelization is going to be a lot of work, but I think we could see huge speedups for the pull request builders if we didn't run all tests for every PR. Most PRs touch the higher-level libraries and not core, so it should be safe to skip most of the tests if core hasn't been modified.

@SparkQA
Copy link

SparkQA commented Apr 8, 2015

Test build #29894 has finished for PR 5096 at commit 59266d1.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29894/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Apr 9, 2015

Test build #29908 has started for PR 5096 at commit bac3a6b.

@pwendell
Copy link
Contributor

pwendell commented Apr 9, 2015

@shivaram - hey one thing I forgot to ask, how much time do the SparkR tests add to the overall Spark tests?

@shivaram
Copy link
Contributor Author

shivaram commented Apr 9, 2015

@pwendell Its around 2 minutes on my laptop. Here is the output on my machine

time ./run-tests.sh
....
....
./run-tests.sh  1:56.96 total

@SparkQA
Copy link

SparkQA commented Apr 9, 2015

Test build #29908 has finished for PR 5096 at commit bac3a6b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29908/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Apr 9, 2015

Test build #29919 has started for PR 5096 at commit da64742.

@SparkQA
Copy link

SparkQA commented Apr 9, 2015

Test build #29919 has finished for PR 5096 at commit da64742.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29919/
Test PASSed.

@shivaram
Copy link
Contributor Author

shivaram commented Apr 9, 2015

Thanks @andrewor14 @pwendell for the reviews. Now that Jenkins is happy I am going merge this in and I'll file follow up issues for things like YARN cluster mode which we didn't get to in this PR.

@concretevitamin
Copy link
Contributor

👍

@asfgit asfgit closed this in 2fe0a1a Apr 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.