[SPARK-7231] [SPARKR] Changes to make SparkR DataFrame dplyr friendly. #6005

shivaram · 2015-05-08T07:30:52Z

Changes include

Rename sortDF to arrange
Add new aliases group_by and sample_frac, summarize
Add more user friendly column addition (mutate), rename
Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr

Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax

The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like select(flights, delay) works in dply but we right now need select(flights, flights$delay) or select(flights, "delay"). But this is a complicated change and I'll file a new issue for it

cc @sun-rui @rxin

Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr

AmplabJenkins · 2015-05-08T07:32:11Z

Merged build triggered.

AmplabJenkins · 2015-05-08T07:32:17Z

Merged build started.

SparkQA · 2015-05-08T07:32:49Z

Test build #32216 has started for PR 6005 at commit 0521149.

rxin · 2015-05-08T07:39:31Z

R/pkg/R/DataFrame.R

+#' @alias sampleDF
+setMethod("sample_frac",
+          signature(x = "DataFrame", withReplacement = "logical",
+                    fraction = "numeric"),


Mostly because at one point we were using 80 char line limit for SparkR. The style isnt' fully changed to 100 char. We can do a full cleanup as a part of https://issues.apache.org/jira/browse/SPARK-6813

SparkQA · 2015-05-08T10:02:50Z

Test build #32216 timed out for PR 6005 at commit 0521149 after a configured wait of 150m.

AmplabJenkins · 2015-05-08T10:02:54Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-08T10:02:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32216/
Test FAILed.

shivaram · 2015-05-08T17:58:36Z

Jenkins, retest this please

AmplabJenkins · 2015-05-08T18:02:11Z

Merged build triggered.

AmplabJenkins · 2015-05-08T18:02:17Z

Merged build started.

SparkQA · 2015-05-08T18:04:03Z

Test build #32250 has started for PR 6005 at commit 0521149.

rxin · 2015-05-08T18:19:04Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   *
+   * @group agg_funcs
+   */
+  def mean(columnName: String): Column = avg(columnName)


can you add it to Python as well?

So this is weird. It already seems to exist in Python

spark/python/pyspark/sql/functions.py

Line 117 in 35d6a99

'mean': 'Aggregate function: returns the average of the values in a group.',

-- Not sure if it was just broken before

…df-api Conflicts: R/pkg/R/DataFrame.R

AmplabJenkins · 2015-05-08T18:42:13Z

Merged build triggered.

AmplabJenkins · 2015-05-08T18:42:22Z

Merged build started.

AmplabJenkins · 2015-05-08T18:57:26Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-08T18:57:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32255/
Test FAILed.

shivaram · 2015-05-08T19:31:32Z

Jenkins, retest this please

AmplabJenkins · 2015-05-08T19:32:11Z

Merged build triggered.

AmplabJenkins · 2015-05-08T19:32:17Z

Merged build started.

SparkQA · 2015-05-08T19:34:19Z

Test build #32259 has started for PR 6005 at commit 5e0716a.

SparkQA · 2015-05-08T20:25:22Z

Test build #32250 has finished for PR 6005 at commit 0521149.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-08T20:25:27Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-08T20:25:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32250/
Test FAILed.

shivaram · 2015-05-08T20:35:28Z

Unrelated python test failure -- @rxin I've seen this fail a couple of times before test_count_by_value_and_window.

Jenkins, retest this please

SparkQA · 2015-05-08T22:01:29Z

Test build #32259 has finished for PR 6005 at commit 5e0716a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class EnumUtil
- class Binarizer(JavaTransformer, HasInputCol, HasOutputCol):
- class IDF(JavaEstimator, HasInputCol, HasOutputCol):
- class IDFModel(JavaModel):
- class Normalizer(JavaTransformer, HasInputCol, HasOutputCol):
- class OneHotEncoder(JavaTransformer, HasInputCol, HasOutputCol):
- class PolynomialExpansion(JavaTransformer, HasInputCol, HasOutputCol):
- class RegexTokenizer(JavaTransformer, HasInputCol, HasOutputCol):
- class StandardScaler(JavaEstimator, HasInputCol, HasOutputCol):
- class StandardScalerModel(JavaModel):
- class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol):
- class StringIndexerModel(JavaModel):
- class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol):
- class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol):
- class Word2Vec(JavaEstimator, HasStepSize, HasMaxIter, HasSeed, HasInputCol, HasOutputCol):
- class Word2VecModel(JavaModel):
- class HasSeed(Params):
- class HasTol(Params):
- class HasStepSize(Params):

AmplabJenkins · 2015-05-08T22:01:34Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-08T22:01:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32259/
Test PASSed.

shivaram · 2015-05-08T23:03:43Z

@rxin - I finally made jenkins happy :) Any other comments ?

also cc @cafreeman who added some of the original DF functions

cafreeman · 2015-05-08T23:58:39Z

This looks pretty cool! And +1 on figuring out how to specify columns without quotes or $. I've been thinking about that quite a bit myself as I've been using the API.

shivaram · 2015-05-09T00:23:38Z

Thanks @cafreeman for taking a look. I opened https://issues.apache.org/jira/browse/SPARK-7499 for the column specification thing. I investigated this a bit yesterday and found one way to do it, but I think involves some code re-org. Feel free to take a shot at it if you get time - we can target it for Spark 1.5

rxin · 2015-05-09T00:27:35Z

lgtm

shivaram · 2015-05-09T01:29:18Z

Merging this

Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like `select(flights, delay)` works in dply but we right now need `select(flights, flights$delay)` or `select(flights, "delay")`. But this is a complicated change and I'll file a new issue for it cc sun-rui rxin Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6005 from shivaram/sparkr-df-api and squashes the following commits: 5e0716a [Shivaram Venkataraman] Fix some roxygen bugs 1254953 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into sparkr-df-api 0521149 [Shivaram Venkataraman] Changes to make SparkR DataFrame dplyr friendly. Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr (cherry picked from commit 0a901dd) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like `select(flights, delay)` works in dply but we right now need `select(flights, flights$delay)` or `select(flights, "delay")`. But this is a complicated change and I'll file a new issue for it cc sun-rui rxin Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes apache#6005 from shivaram/sparkr-df-api and squashes the following commits: 5e0716a [Shivaram Venkataraman] Fix some roxygen bugs 1254953 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into sparkr-df-api 0521149 [Shivaram Venkataraman] Changes to make SparkR DataFrame dplyr friendly. Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr

rxin reviewed May 8, 2015
View reviewed changes

shivaram added 2 commits May 8, 2015 11:29

Merge branch 'master' of https://github.com/apache/spark into sparkr-…

1254953

…df-api Conflicts: R/pkg/R/DataFrame.R

Fix some roxygen bugs

5e0716a

asfgit closed this in 0a901dd May 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-7231] [SPARKR] Changes to make SparkR DataFrame dplyr friendly. #6005

[SPARK-7231] [SPARKR] Changes to make SparkR DataFrame dplyr friendly. #6005

shivaram commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

SparkQA commented May 8, 2015

rxin May 8, 2015

shivaram May 8, 2015

SparkQA commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

shivaram commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

SparkQA commented May 8, 2015

rxin May 8, 2015

shivaram May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

shivaram commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

SparkQA commented May 8, 2015

SparkQA commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

shivaram commented May 8, 2015

SparkQA commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

shivaram commented May 8, 2015

cafreeman commented May 8, 2015

shivaram commented May 9, 2015

rxin commented May 9, 2015

shivaram commented May 9, 2015

[SPARK-7231] [SPARKR] Changes to make SparkR DataFrame dplyr friendly. #6005

[SPARK-7231] [SPARKR] Changes to make SparkR DataFrame dplyr friendly. #6005

Conversation

shivaram commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

SparkQA commented May 8, 2015

rxin May 8, 2015

Choose a reason for hiding this comment

shivaram May 8, 2015

Choose a reason for hiding this comment

SparkQA commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

shivaram commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

SparkQA commented May 8, 2015

rxin May 8, 2015

Choose a reason for hiding this comment

shivaram May 8, 2015

Choose a reason for hiding this comment

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

shivaram commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

SparkQA commented May 8, 2015

SparkQA commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

shivaram commented May 8, 2015

SparkQA commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

shivaram commented May 8, 2015

cafreeman commented May 8, 2015

shivaram commented May 9, 2015

rxin commented May 9, 2015

shivaram commented May 9, 2015