Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7231] [SPARKR] Changes to make SparkR DataFrame dplyr friendly. #6005

Closed
wants to merge 3 commits into from

Conversation

shivaram
Copy link
Contributor

@shivaram shivaram commented May 8, 2015

Changes include

  1. Rename sortDF to arrange
  2. Add new aliases group_by and sample_frac, summarize
  3. Add more user friendly column addition (mutate), rename
  4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr

Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax

The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like select(flights, delay) works in dply but we right now need select(flights, flights$delay) or select(flights, "delay"). But this is a complicated change and I'll file a new issue for it

cc @sun-rui @rxin

Changes include
1. Rename sortDF to arrange
2. Add new aliases `group_by` and `sample_frac`, `summarize`
3. Add more user friendly column addition (mutate), rename
4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32216 has started for PR 6005 at commit 0521149.

#' @alias sampleDF
setMethod("sample_frac",
signature(x = "DataFrame", withReplacement = "logical",
fraction = "numeric"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wrap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly because at one point we were using 80 char line limit for SparkR. The style isnt' fully changed to 100 char. We can do a full cleanup as a part of https://issues.apache.org/jira/browse/SPARK-6813

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32216 timed out for PR 6005 at commit 0521149 after a configured wait of 150m.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32216/
Test FAILed.

@shivaram
Copy link
Contributor Author

shivaram commented May 8, 2015

Jenkins, retest this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32250 has started for PR 6005 at commit 0521149.

*
* @group agg_funcs
*/
def mean(columnName: String): Column = avg(columnName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add it to Python as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is weird. It already seems to exist in Python

'mean': 'Aggregate function: returns the average of the values in a group.',
-- Not sure if it was just broken before

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32255/
Test FAILed.

@shivaram
Copy link
Contributor Author

shivaram commented May 8, 2015

Jenkins, retest this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32259 has started for PR 6005 at commit 5e0716a.

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32250 has finished for PR 6005 at commit 0521149.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32250/
Test FAILed.

@shivaram
Copy link
Contributor Author

shivaram commented May 8, 2015

Unrelated python test failure -- @rxin I've seen this fail a couple of times before test_count_by_value_and_window.

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32259 has finished for PR 6005 at commit 5e0716a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class EnumUtil
    • class Binarizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class IDF(JavaEstimator, HasInputCol, HasOutputCol):
    • class IDFModel(JavaModel):
    • class Normalizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class OneHotEncoder(JavaTransformer, HasInputCol, HasOutputCol):
    • class PolynomialExpansion(JavaTransformer, HasInputCol, HasOutputCol):
    • class RegexTokenizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class StandardScaler(JavaEstimator, HasInputCol, HasOutputCol):
    • class StandardScalerModel(JavaModel):
    • class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol):
    • class StringIndexerModel(JavaModel):
    • class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol):
    • class Word2Vec(JavaEstimator, HasStepSize, HasMaxIter, HasSeed, HasInputCol, HasOutputCol):
    • class Word2VecModel(JavaModel):
    • class HasSeed(Params):
    • class HasTol(Params):
    • class HasStepSize(Params):

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32259/
Test PASSed.

@shivaram
Copy link
Contributor Author

shivaram commented May 8, 2015

@rxin - I finally made jenkins happy :) Any other comments ?

also cc @cafreeman who added some of the original DF functions

@cafreeman
Copy link

This looks pretty cool! And +1 on figuring out how to specify columns without quotes or $. I've been thinking about that quite a bit myself as I've been using the API.

@shivaram
Copy link
Contributor Author

shivaram commented May 9, 2015

Thanks @cafreeman for taking a look. I opened https://issues.apache.org/jira/browse/SPARK-7499 for the column specification thing. I investigated this a bit yesterday and found one way to do it, but I think involves some code re-org. Feel free to take a shot at it if you get time - we can target it for Spark 1.5

@rxin
Copy link
Contributor

rxin commented May 9, 2015

lgtm

@shivaram
Copy link
Contributor Author

shivaram commented May 9, 2015

Merging this

asfgit pushed a commit that referenced this pull request May 9, 2015
Changes include
1. Rename sortDF to arrange
2. Add new aliases `group_by` and `sample_frac`, `summarize`
3. Add more user friendly column addition (mutate), rename
4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr

Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax

The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like `select(flights, delay)` works in dply but we right now need `select(flights, flights$delay)` or `select(flights, "delay")`. But this is a complicated change and I'll file a new issue for it

cc sun-rui rxin

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6005 from shivaram/sparkr-df-api and squashes the following commits:

5e0716a [Shivaram Venkataraman] Fix some roxygen bugs
1254953 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into sparkr-df-api
0521149 [Shivaram Venkataraman] Changes to make SparkR DataFrame dplyr friendly. Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr

(cherry picked from commit 0a901dd)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
@asfgit asfgit closed this in 0a901dd May 9, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
Changes include
1. Rename sortDF to arrange
2. Add new aliases `group_by` and `sample_frac`, `summarize`
3. Add more user friendly column addition (mutate), rename
4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr

Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax

The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like `select(flights, delay)` works in dply but we right now need `select(flights, flights$delay)` or `select(flights, "delay")`. But this is a complicated change and I'll file a new issue for it

cc sun-rui rxin

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes apache#6005 from shivaram/sparkr-df-api and squashes the following commits:

5e0716a [Shivaram Venkataraman] Fix some roxygen bugs
1254953 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into sparkr-df-api
0521149 [Shivaram Venkataraman] Changes to make SparkR DataFrame dplyr friendly. Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
Changes include
1. Rename sortDF to arrange
2. Add new aliases `group_by` and `sample_frac`, `summarize`
3. Add more user friendly column addition (mutate), rename
4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr

Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax

The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like `select(flights, delay)` works in dply but we right now need `select(flights, flights$delay)` or `select(flights, "delay")`. But this is a complicated change and I'll file a new issue for it

cc sun-rui rxin

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes apache#6005 from shivaram/sparkr-df-api and squashes the following commits:

5e0716a [Shivaram Venkataraman] Fix some roxygen bugs
1254953 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into sparkr-df-api
0521149 [Shivaram Venkataraman] Changes to make SparkR DataFrame dplyr friendly. Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Changes include
1. Rename sortDF to arrange
2. Add new aliases `group_by` and `sample_frac`, `summarize`
3. Add more user friendly column addition (mutate), rename
4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr

Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax

The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like `select(flights, delay)` works in dply but we right now need `select(flights, flights$delay)` or `select(flights, "delay")`. But this is a complicated change and I'll file a new issue for it

cc sun-rui rxin

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes apache#6005 from shivaram/sparkr-df-api and squashes the following commits:

5e0716a [Shivaram Venkataraman] Fix some roxygen bugs
1254953 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into sparkr-df-api
0521149 [Shivaram Venkataraman] Changes to make SparkR DataFrame dplyr friendly. Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants