-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20550][SPARKR] R wrapper for Dataset.alias #17825
Conversation
Test build #76365 has finished for PR 17825 at commit
|
Test build #76369 has finished for PR 17825 at commit
|
@@ -2253,6 +2253,15 @@ test_that("mutate(), transform(), rename() and names()", { | |||
detach(airquality) | |||
}) | |||
|
|||
test_that("alias on SparkDataFrame", { | |||
df <- alias(read.df(jsonPath, "json"), "table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of adding a new test, add to one already naming things to reuse an existing df?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because trying to make a set of tests that makes sense for CRAN
#17817
R/pkg/R/DataFrame.R
Outdated
#' head(select(df, column("mtcars.mpg"))) | ||
#' head(join(df, avg_mpg, column("mtcars.cyl") == column("avg_mpg.cyl"))) | ||
#' } | ||
#' @note alias since 2.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we put type in the note for each overload
https://github.com/apache/spark/blob/master/R/pkg/R/mllib_classification.R#L121
R/pkg/R/column.R
Outdated
#' @param data new name to use | ||
#' | ||
#' @rdname alias | ||
#' @name alias | ||
#' @aliases alias,Column-method | ||
#' @family colum_func | ||
#' @export | ||
#' @examples \dontrun{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think generally we put \dontrun on the next line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a few similar cases like this, say this with some discussions at the time.
there really isn't a single good place for it so we elect to put the doc in generic and both "overload" share the same rd page, like many other R methods does.
i think that's ok, but we should probably be more clear on the documentation - the existing description is quite vague, we should really be clear that a new "thing" is being returned with the new name, and not renaming the existing one
R/pkg/R/DataFrame.R
Outdated
#' @aliases alias,SparkDataFrame-method | ||
#' @rdname alias | ||
#' @name alias | ||
#' @examples \dontrun{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
That makes sense I guess. It would be great to have more control over the layout though. One can dream, right? :) Thank you so much for all the reviews and information. |
Test build #76373 has finished for PR 17825 at commit
|
R/pkg/R/column.R
Outdated
@@ -132,17 +132,24 @@ createMethods() | |||
|
|||
#' alias | |||
#' | |||
#' Set a new name for a column | |||
#' Set a new name for an object. Equivalent to SQL "AS" keyword. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, this is Scala doc for Column.alias Gives the column an alias
(which is not very concise)
Dataset.alias Returns a new Dataset with an alias set.
I think we need to say Set a new name to return as a new object
or similar. Actually I think we should say "Column or SparkDataFrame" in place of "object" - what do you think?
I think the SQL "AS"
part but perhaps it will be more clear if lead with "for Column, ..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think this doc block (description, param list specifically) should be move to DataFrame.R or generic.R as mentioned before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving to generics.R
sounds good. "Column or SparkDataFrame" in place of "object" as well.
Regarding "AS"... In SQL it can be used with both expressions and tables so I deliberately didn't quantify this with Column
.
I am not sure if we really need to state that it returns a new object. Maybe Return a new Column or SparkDataFrame with an alias. Equivalent to SQL "AS" keyword.? But it doesn't sound great.
Test build #76400 has finished for PR 17825 at commit
|
R/pkg/R/DataFrame.R
Outdated
#' @aliases alias,SparkDataFrame-method | ||
#' @rdname alias | ||
#' @name alias | ||
#' @examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add @family SparkDataFrame functions
I think we should probably review all these @family
at one point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I general it would nice to sweep all the files to make it more consistent. Capitalization, punctuation, examples. return and such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes!
I wonder if it would make more sense to make
and skip the type checks. |
Test build #76437 has finished for PR 17825 at commit
|
you know - it would definitely be a better experience for the R user, so we should try that - it might break with the generic in and speaking of, we should probably add a test for |
Oh right |
Test build #76444 has finished for PR 17825 at commit
|
Test build #76445 has finished for PR 17825 at commit
|
- Remove param annotations from dataframe.alias - Use generic annotations for column.alias
Test build #76451 has finished for PR 17825 at commit
|
Test build #76452 has finished for PR 17825 at commit
|
could you close/reopen to trigger appveyor again |
R/pkg/R/generics.R
Outdated
@@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { standardGeneric("value") }) | |||
#' @export | |||
setGeneric("agg", function (x, ...) { standardGeneric("agg") }) | |||
|
|||
#' alias | |||
#' | |||
#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" keyword. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right - I think again we should emphasize on returning a new SparkDataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
#' Return a new Column or a SparkDataFrame with a name set. Equivalent to SQL "AS" keyword.
Is the Column
new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't say return a new Column
but more generally return a Column
and in other cases we say return a new SparkDataFrame
so I guess it's a difference in wording.
I think what you propose is fine, though do you think it's confusing to say Equivalent to SQL "AS" keyword.
because that makes sense only for Column and not the whole dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still believe that AS
is applicable to both. Essentially what we do is:
SELECT old_column AS new_column FROM table
and
(SELECT * FROM old_table) AS new_table
--or
SELECT * FROM old_table AS new_table
R/pkg/R/generics.R
Outdated
#' @name alias | ||
#' @rdname alias | ||
#' @param object x a Column or a SparkDataFrame | ||
#' @param data new name to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we have a @return
here? perhaps to say
Returns a new SparkDataFrame or Column with an alias set.
For Column, equivalent to SQL "AS" keyword.
@return a new SparkDataFrame or Column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that we did, at one point. I think the feedback is we could have one line for parameter (object
) and return value could be more than one but which line matches which input parameter type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I find both equally confusing, so if you think that a single annotation is better, I am happy to oblige.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's true actually.
if you think it's useful we could always have them in separate rd.
I'm pretty sure @rdname
needs to match @aliases
to fix multiple link bug https://issues.apache.org/jira/browse/SPARK-18825; which means we can't have multiple functions in the same rd - each has to have its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the bright side it looks like matching @rdname
and @aliases
like:
#' alias
#'
#' @aliases alias,SparkDataFrame-method
#' @family SparkDataFrame functions
#' @rdname alias,SparkDataFrame-method
#' @name alias
...
and
#' alias
#'
#' @aliases alias,SparkDataFrame-method
#' @family SparkDataFrame functions
#' @rdname alias,SparkDataFrame-method
#' @name alias
...
(I hope this is what you mean) indeed solves SPARK-18825. But it doesn't generate any docs for these two and makes CRAN checker unhappy:
Undocumented S4 methods:
generic 'alias' and siglist 'Column'
generic 'alias' and siglist 'SparkDataFrame'
Docs for generic are created but it doesn't help us here. Even if we bring @examples
there we still have to deal with CRAN.
Theres is also my favorite \name must exist and be unique in Rd files
which doesn't gives us much room here, does it?
I opened to suggestions, but personally I am out ideas. I've been digging trough roxygen
docs, but between CRAN, S4 requirements, roxygen
limitation and our own rules there is not much room left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sigh, sadly I think you have captured all the constraints we are working with here.
let's get the 3 lines in the same order
#' Returns a new SparkDataFrame or Column with an alias set. Equivalent to SQL "AS" keyword.
#' @param object x a Column or a SparkDataFrame
#' @return a Column or a SparkDataFrame
to
#' Returns a new SparkDataFrame or Column with an alias set. Equivalent to SQL "AS" keyword.
#' @param object x a SparkDataFrame or Column
#' @return a SparkDataFrame or a Column
Test build #76503 has finished for PR 17825 at commit
|
R/pkg/R/DataFrame.R
Outdated
#' @family SparkDataFrame functions | ||
#' @rdname alias | ||
#' @name alias | ||
#' @examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add @export
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but do we actually need this? We don't use roxygen to maintain NAMESPACE
, and (I believe i mentioned this before) we @export
objects which are not really exported. Just saying...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, it's more for tracking it manually
Test build #76514 has finished for PR 17825 at commit
|
could you keep the description and return type in the same order in this #17825 (comment) it's not great, but it's the best we can do |
Test build #76552 has finished for PR 17825 at commit
|
merged to master. |
Thanks @felixcheung |
## What changes were proposed in this pull request? - Add SparkR wrapper for `Dataset.alias`. - Adjust roxygen annotations for `functions.alias` (including example usage). ## How was this patch tested? Unit tests, `check_cran.sh`. Author: zero323 <zero323@users.noreply.github.com> Closes apache#17825 from zero323/SPARK-20550.
What changes were proposed in this pull request?
Dataset.alias
.functions.alias
(including example usage).How was this patch tested?
Unit tests,
check_cran.sh
.