[SPARK-20550][SPARKR] R wrapper for Dataset.alias #17825

zero323 · 2017-05-01T22:54:57Z

What changes were proposed in this pull request?

Add SparkR wrapper for Dataset.alias.
Adjust roxygen annotations for functions.alias (including example usage).

How was this patch tested?

Unit tests, check_cran.sh.

zero323 · 2017-05-01T22:57:19Z

This may require some discussion. Right now we get a generic docs like this:

Does it make sense to put both in the same file? If not where should Dataset.alias go?

Names are not optimal, but I guess we'll keep it as is to avoid issues with stats::alias.

SparkQA · 2017-05-01T23:30:41Z

Test build #76365 has finished for PR 17825 at commit b7d079b.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-02T02:37:55Z

Test build #76369 has finished for PR 17825 at commit 7abe6ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-05-02T04:17:05Z

R/pkg/inst/tests/testthat/test_sparkSQL.R

@@ -2253,6 +2253,15 @@ test_that("mutate(), transform(), rename() and names()", {
  detach(airquality)
 })

+test_that("alias on SparkDataFrame", {
+  df <- alias(read.df(jsonPath, "json"), "table")


instead of adding a new test, add to one already naming things to reuse an existing df?

because trying to make a set of tests that makes sense for CRAN
#17817

felixcheung · 2017-05-02T04:17:30Z

R/pkg/R/DataFrame.R

+#' head(select(df, column("mtcars.mpg")))
+#' head(join(df, avg_mpg, column("mtcars.cyl") == column("avg_mpg.cyl")))
+#' }
+#' @note alias since 2.3.0


then we put type in the note for each overload
https://github.com/apache/spark/blob/master/R/pkg/R/mllib_classification.R#L121

felixcheung · 2017-05-02T04:18:35Z

R/pkg/R/column.R

 #' @param data new name to use
 #'
 #' @rdname alias
 #' @name alias
 #' @aliases alias,Column-method
 #' @family colum_func
 #' @export
+#' @examples \dontrun{


think generally we put \dontrun on the next line

felixcheung

we have a few similar cases like this, say this with some discussions at the time.

there really isn't a single good place for it so we elect to put the doc in generic and both "overload" share the same rd page, like many other R methods does.

i think that's ok, but we should probably be more clear on the documentation - the existing description is quite vague, we should really be clear that a new "thing" is being returned with the new name, and not renaming the existing one

felixcheung · 2017-05-02T04:18:51Z

R/pkg/R/DataFrame.R

+#' @aliases alias,SparkDataFrame-method
+#' @rdname alias
+#' @name alias
+#' @examples \dontrun{


zero323 · 2017-05-02T05:17:36Z

That makes sense I guess. It would be great to have more control over the layout though. One can dream, right? :)

Thank you so much for all the reviews and information.

SparkQA · 2017-05-02T05:40:39Z

Test build #76373 has finished for PR 17825 at commit d892f30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-05-02T16:45:33Z

R/pkg/R/column.R

@@ -132,17 +132,24 @@ createMethods()

 #' alias
 #'
-#' Set a new name for a column
+#' Set a new name for an object. Equivalent to SQL "AS" keyword.


right, this is Scala doc for Column.alias Gives the column an alias (which is not very concise)
Dataset.alias Returns a new Dataset with an alias set.

I think we need to say Set a new name to return as a new object or similar. Actually I think we should say "Column or SparkDataFrame" in place of "object" - what do you think?

I think the SQL "AS" part but perhaps it will be more clear if lead with "for Column, ..."?

Also, I think this doc block (description, param list specifically) should be move to DataFrame.R or generic.R as mentioned before.

Moving to generics.R sounds good. "Column or SparkDataFrame" in place of "object" as well.

Regarding "AS"... In SQL it can be used with both expressions and tables so I deliberately didn't quantify this with Column.

I am not sure if we really need to state that it returns a new object. Maybe Return a new Column or SparkDataFrame with an alias. Equivalent to SQL "AS" keyword.? But it doesn't sound great.

SparkQA · 2017-05-03T01:00:41Z

Test build #76400 has finished for PR 17825 at commit 32fd836.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-05-03T16:18:48Z

R/pkg/R/DataFrame.R

+#' @aliases alias,SparkDataFrame-method
+#' @rdname alias
+#' @name alias
+#' @examples


add @family SparkDataFrame functions
I think we should probably review all these @family at one point...

I general it would nice to sweep all the files to make it more consistent. Capitalization, punctuation, examples. return and such.

zero323 · 2017-05-04T01:15:19Z

I wonder if it would make more sense to make alias generic for both object and data:

 signature(object = "SparkDataFrame", data = "character")

and skip the type checks.

SparkQA · 2017-05-04T01:51:52Z

Test build #76437 has finished for PR 17825 at commit e06544f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-05-04T04:59:59Z

you know - it would definitely be a better experience for the R user, so we should try that - it might break with the generic in stats::alias though

and speaking of, we should probably add a test for stats:alias to see it is callable without stats::

zero323 · 2017-05-04T06:52:48Z

Oh right stats::alias is S3. Scratch that. Added test.

SparkQA · 2017-05-04T07:00:59Z

Test build #76444 has finished for PR 17825 at commit 60fcd8a.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-04T07:40:42Z

Test build #76445 has finished for PR 17825 at commit 875921b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

- Remove param annotations from dataframe.alias - Use generic annotations for column.alias

SparkQA · 2017-05-04T10:01:44Z

Test build #76451 has finished for PR 17825 at commit 09f9cca.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-04T10:58:04Z

Test build #76452 has finished for PR 17825 at commit f1c74f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-05-05T02:27:35Z

could you close/reopen to trigger appveyor again

felixcheung · 2017-05-05T02:28:37Z

R/pkg/R/generics.R

@@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { standardGeneric("value") })
 #' @export
 setGeneric("agg", function (x, ...) { standardGeneric("agg") })

+#' alias
+#'
+#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" keyword.


right - I think again we should emphasize on returning a new SparkDataFrame

How about?

#' Return a new Column or a SparkDataFrame with a name set. Equivalent to SQL "AS" keyword.

Is the Column new?

I guess we don't say return a new Column but more generally return a Column
and in other cases we say return a new SparkDataFrame

so I guess it's a difference in wording.
I think what you propose is fine, though do you think it's confusing to say Equivalent to SQL "AS" keyword. because that makes sense only for Column and not the whole dataframe?

I still believe that AS is applicable to both. Essentially what we do is:

SELECT old_column AS new_column FROM table

and

(SELECT * FROM old_table) AS new_table --or SELECT * FROM old_table AS new_table

felixcheung · 2017-05-05T03:57:46Z

R/pkg/R/generics.R

+#' @name alias
+#' @rdname alias
+#' @param object x a Column or a SparkDataFrame
+#' @param data new name to use


shouldn't we have a @return here? perhaps to say

Returns a new SparkDataFrame or Column with an alias set. For Column, equivalent to SQL "AS" keyword. @return a new SparkDataFrame or Column

Wouldn't be better to annotate actual implementations? To get something like this:

that we did, at one point. I think the feedback is we could have one line for parameter (object) and return value could be more than one but which line matches which input parameter type?

To be honest I find both equally confusing, so if you think that a single annotation is better, I am happy to oblige.

that's true actually.
if you think it's useful we could always have them in separate rd.
I'm pretty sure @rdname needs to match @aliases to fix multiple link bug https://issues.apache.org/jira/browse/SPARK-18825; which means we can't have multiple functions in the same rd - each has to have its own.

On the bright side it looks like matching @rdname and @aliases like:

#' alias #' #' @aliases alias,SparkDataFrame-method #' @family SparkDataFrame functions #' @rdname alias,SparkDataFrame-method #' @name alias ...

and

#' alias #' #' @aliases alias,SparkDataFrame-method #' @family SparkDataFrame functions #' @rdname alias,SparkDataFrame-method #' @name alias ...

(I hope this is what you mean) indeed solves SPARK-18825. But it doesn't generate any docs for these two and makes CRAN checker unhappy:

Undocumented S4 methods: generic 'alias' and siglist 'Column' generic 'alias' and siglist 'SparkDataFrame'

Docs for generic are created but it doesn't help us here. Even if we bring @examples there we still have to deal with CRAN.

Theres is also my favorite \name must exist and be unique in Rd files which doesn't gives us much room here, does it?

I opened to suggestions, but personally I am out ideas. I've been digging trough roxygen docs, but between CRAN, S4 requirements, roxygen limitation and our own rules there is not much room left.

sigh, sadly I think you have captured all the constraints we are working with here.

let's get the 3 lines in the same order

#' Returns a new SparkDataFrame or Column with an alias set. Equivalent to SQL "AS" keyword. #' @param object x a Column or a SparkDataFrame #' @return a Column or a SparkDataFrame

to

#' Returns a new SparkDataFrame or Column with an alias set. Equivalent to SQL "AS" keyword. #' @param object x a SparkDataFrame or Column #' @return a SparkDataFrame or a Column

SparkQA · 2017-05-05T21:34:25Z

Test build #76503 has finished for PR 17825 at commit 505561a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-05-06T05:11:08Z

R/pkg/R/DataFrame.R

+#' @family SparkDataFrame functions
+#' @rdname alias
+#' @name alias
+#' @examples


add @export

Done, but do we actually need this? We don't use roxygen to maintain NAMESPACE, and (I believe i mentioned this before) we @export objects which are not really exported. Just saying...

true, it's more for tracking it manually

SparkQA · 2017-05-06T06:17:57Z

Test build #76514 has finished for PR 17825 at commit 1f1e72b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-05-07T20:17:11Z

could you keep the description and return type in the same order in this #17825 (comment)

it's not great, but it's the best we can do

SparkQA · 2017-05-07T22:21:52Z

Test build #76552 has finished for PR 17825 at commit 2b8f288.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-05-07T23:25:23Z

merged to master.
thank you for working on this and hopefully we could really improve a lot of the things we have discussed. 👍

zero323 · 2017-05-09T11:40:19Z

Thanks @felixcheung

## What changes were proposed in this pull request? - Add SparkR wrapper for `Dataset.alias`. - Adjust roxygen annotations for `functions.alias` (including example usage). ## How was this patch tested? Unit tests, `check_cran.sh`. Author: zero323 <zero323@users.noreply.github.com> Closes apache#17825 from zero323/SPARK-20550.

felixcheung reviewed May 2, 2017

View reviewed changes

felixcheung reviewed May 3, 2017

View reviewed changes

zero323 force-pushed the SPARK-20550 branch from e06544f to 60fcd8a Compare May 4, 2017 06:53

zero323 added 11 commits May 4, 2017 10:49

Initial implementation

944a3ec

Adjust argument annotations

5e9f8da

- Remove param annotations from dataframe.alias - Use generic annotations for column.alias

Add usage examples to column.alias

73133f9

Remove return type annotation

848eeef

Fix typo

05c0781

Move dontruns to their own lines

22d7cf6

Extend param description

22e1292

Add type annotations to since notes

6bb3d91

Attach alias test to select-with-column test case

b3c1a41

Extend description

40fedcb

Move alias documentation to generics

1e1ad44

zero323 added 3 commits May 4, 2017 10:50

Add family annotation

2d5ace2

Check that stats::alias is not masked

5fe5495

Fix style

09f9cca

zero323 force-pushed the SPARK-20550 branch from 875921b to 09f9cca Compare May 4, 2017 09:54

vim

f1c74f3

felixcheung reviewed May 5, 2017

View reviewed changes

zero323 closed this May 5, 2017

zero323 reopened this May 5, 2017

felixcheung reviewed May 5, 2017

View reviewed changes

zero323 added 2 commits May 5, 2017 20:29

Emphasize that alias returns new DataFrame

43c02bc

Add return to generic alias

505561a

felixcheung reviewed May 6, 2017

View reviewed changes

Add export

1f1e72b

Reorder annotations

2b8f288

felixcheung approved these changes May 7, 2017

View reviewed changes

asfgit closed this in 1f73d35 May 7, 2017

zero323 deleted the SPARK-20550 branch February 2, 2020 17:49

[SPARK-20550][SPARKR] R wrapper for Dataset.alias #17825

[SPARK-20550][SPARKR] R wrapper for Dataset.alias #17825

Conversation

zero323 commented May 1, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

zero323 commented May 1, 2017

SparkQA commented May 1, 2017

SparkQA commented May 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zero323 commented May 2, 2017

SparkQA commented May 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zero323 commented May 4, 2017

SparkQA commented May 4, 2017

felixcheung commented May 4, 2017

zero323 commented May 4, 2017 • edited Loading

SparkQA commented May 4, 2017

SparkQA commented May 4, 2017

SparkQA commented May 4, 2017

SparkQA commented May 4, 2017

felixcheung commented May 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zero323 May 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung May 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 6, 2017

felixcheung commented May 7, 2017

SparkQA commented May 7, 2017

felixcheung commented May 7, 2017

zero323 commented May 9, 2017

zero323 commented May 1, 2017 •

edited

Loading

felixcheung left a comment •

edited

Loading

zero323 commented May 4, 2017 •

edited

Loading

zero323 May 5, 2017 •

edited

Loading

felixcheung May 5, 2017 •

edited

Loading