-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5654] Integrate SparkR #5096
Conversation
define generic for 'first' in RDD API
[SPARKR-189] [SPARKR-190] Column and expression
… api Conflicts: pkg/R/RDD.R
… group Conflicts: pkg/NAMESPACE pkg/R/DataFrame.R pkg/R/utils.R pkg/inst/tests/test_sparkSQL.R
Conflicts: pkg/NAMESPACE pkg/R/DataFrame.R
Updated column to use `functions` instead of `Dsl` in accordance with the new API changes. Also created separate classes for `asc` and `desc`.
New 1.3 repo and updates to `column.R`
New DataFrame methods: - `join` - `sort` - `orderBy` - `filter` - `where`
… api Conflicts: pkg/src/build.sbt
[SparkR-209] `join`, `sort`, `filter` methods for DataFrame
Jenkins, retest this please (is the fourth time lucky ?) |
Test build #29870 has started for PR 5096 at commit |
Test FAILed. |
@shivaram a few things after looking at the build code some more...
|
|
||
def readBoolean(in: DataInputStream): Boolean = { | ||
val intVal = in.readInt() | ||
if (intVal == 0) false else true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be intVal != 0
SparkSubmit parts LGTM. We should merge this soon so people can start testing this well in advance of the release window. |
@brennonyork The overall Jenkins build runner has a timeout of 130 minutes right now (cc @shaneknapp). So all the RAT tests, Mima checks, style checks, new dependencies plus all the unit tests have to run within 130 minutes and this PR seems to be failing that. @shaneknapp can we increase the 130 min timeout to say 140 minutes ? |
i'll up it to 180, just so we have some headroom. On Wed, Apr 8, 2015 at 2:21 PM, Shivaram Venkataraman <
|
Thanks @shaneknapp ! Could you re-trigger this build once its upped ? |
jenkins, test this please |
also, i can't believe how long this build is... sad panda etc. |
Test build #29894 has started for PR 5096 at commit |
Test parallelization is going to be a lot of work, but I think we could see huge speedups for the pull request builders if we didn't run all tests for every PR. Most PRs touch the higher-level libraries and not core, so it should be safe to skip most of the tests if core hasn't been modified. |
Test build #29894 has finished for PR 5096 at commit
|
Test FAILed. |
Test build #29908 has started for PR 5096 at commit |
@shivaram - hey one thing I forgot to ask, how much time do the SparkR tests add to the overall Spark tests? |
@pwendell Its around 2 minutes on my laptop. Here is the output on my machine
|
Test build #29908 has finished for PR 5096 at commit
|
Test FAILed. |
Test build #29919 has started for PR 5096 at commit |
Test build #29919 has finished for PR 5096 at commit
|
Test PASSed. |
Thanks @andrewor14 @pwendell for the reviews. Now that Jenkins is happy I am going merge this in and I'll file follow up issues for things like YARN cluster mode which we didn't get to in this PR. |
👍 |
This pull requests integrates SparkR, an R frontend for Spark. The SparkR package contains both RDD and DataFrame APIs in R and is integrated with Spark's submission scripts to work on different cluster managers.
Some integration points that would be great to get feedback on:
-PsparkR
that can be used to enable SparkR buildsThe SparkR package represents the work of many contributors and attached below is a list of people along with areas they worked on
edwardt (@edwart) - Documentation improvements
Felix Cheung (@felixcheung) - Documentation improvements
Hossein Falaki (@falaki) - Documentation improvements
Chris Freeman (@cafreeman) - DataFrame API, Programming Guide
Todd Gao (@7c00) - R worker Internals
Ryan Hafen (@hafen) - SparkR Internals
Qian Huang (@hqzizania) - RDD API
Hao Lin (@hlin09) - RDD API, Closure cleaner
Evert Lammerts (@evertlammerts) - DataFrame API
Davies Liu (@davies) - DataFrame API, R worker internals, Merging with Spark
Yi Lu (@lythesia) - RDD API, Worker internals
Matt Massie (@massie) - Jenkins build
Harihar Nahak (@hnahak87) - SparkR examples
Oscar Olmedo (@oscaroboto) - Spark configuration
Antonio Piccolboni (@piccolbo) - SparkR examples, Namespace bug fixes
Dan Putler (@dputler) - Dataframe API, SparkR Install Guide
Ashutosh Raina (@ashutoshraina) - Build improvements
Josh Rosen (@JoshRosen) - Travis CI build
Sun Rui (@sun-rui)- RDD API, JVM Backend, Shuffle improvements
Shivaram Venkataraman (@shivaram) - RDD API, JVM Backend, Worker Internals
Zongheng Yang (@concretevitamin) - RDD API, Pipelined RDDs, Examples and EC2 guide