Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add verb uncount() #558

Merged
merged 15 commits into from
Jan 19, 2021
Merged

add verb uncount() #558

merged 15 commits into from
Jan 19, 2021

Conversation

mgirlich
Copy link
Collaborator

@mgirlich mgirlich commented Dec 7, 2020

As uncount() currently isn't a generic (also see tidyverse/tidyr#1071) I named the function with dbplyr_uncount().

I see two strategies for translating:

  • use a recursive common table expression (see this stackoverflow discussion)
  • or use an unequal join with a temporary table (the strategy I chose)

The recursive CTE would require much more code and also more testing across backends. Therefore, I simply chose the unequal join. I'm not sure how big the performance difference is.

There are also two variants to create the temporary table

  • use the generate_series() function. Unfortunately, this is only available as an SQLite extension which is not included in RSQLite.
  • get the maximum value of the weights column and create a local tibble to upload.

The functionality is basically the same as the tidyr version with the small difference that the weights are not checked as strictly (i.e. no error if some weights are not non-negative integers).

Overview of the strategy:

  1. get the max weight, n_max
  2. create a temporary db table with ids from 1 to n_max
  3. duplicate data by inner joining with this temporary table under the condition that data$weight <= tmp$id

The generated SQL looks like this

SELECT `x`,
       `test`
FROM
  (SELECT `x`,
          `n`,
          `test`
   FROM `dbplyr_366` AS `LHS`
   INNER JOIN `dbplyr_367` AS `RHS` ON (`RHS`.`test` <= `LHS`.`n`))

One could simplify the SQL a little

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your summary of the approach in the PR body was super helpful; but I think it also needs to be included in the code as comments so that when looking at the code in the future it's easy to see the basic approach.

R/verb-uncount.R Outdated
@@ -0,0 +1,73 @@
#' "Uncount" a database table
#'
#' This is a method for the tidyr `uncount()` generic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to include a comment that it will uses a temporary table

R/verb-uncount.R Outdated Show resolved Hide resolved
tests/testthat/test-verb-uncount.R Show resolved Hide resolved
@mgirlich
Copy link
Collaborator Author

The SQL snapshots don't work nicely together with the different backends. Depending on the active backend more or less temporary tables names are created, i.e. dbplyr_table_name has a different value. This is an issue for this snapshot test. For now I simply set the option manually as a workaround but it's not exactly the nicest workaround.
One could also force the backend tests to run later by changing their file name but that's not very nice either to not follow the normal conventions. Do you have a better idea?

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just let me know how you want to merge these. If we wait and merge them all after they're all complete, I'm happy to handle the merge conflicts in the NEWS file on my end.

@mgirlich
Copy link
Collaborator Author

I think prepared the other PRs to be ready to be merged. I guess it is easier and faster if you handle the merge conflicts on your side. Thanks!
By the way: great work with the new testthat. It is really nice and easy to work with the snapshots!

@hadley hadley merged commit f5dd4b2 into tidyverse:master Jan 19, 2021
@hadley hadley mentioned this pull request Jan 21, 2021
@mgirlich mgirlich deleted the add-uncount branch March 12, 2021 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants