Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rep_sample_n() fix (#279) #325

Merged
merged 14 commits into from
Jul 25, 2020
Merged

rep_sample_n() fix (#279) #325

merged 14 commits into from
Jul 25, 2020

Conversation

simonpcouch
Copy link
Collaborator

@simonpcouch simonpcouch commented Jul 25, 2020

Alright, here’s a go at a fix for 279! This approach borrows heavily from @unoe’s implementation, and relies on the slice_sample() implementation from dplyr for most argument checking and errors.

All previous unit tests pass, and I've extended the testing a bit so that we'd hopefully catch an issue like 279 earlier on if one came up again. There's also some documentation clarification. From what I can tell, rep_sample_n() never checked that the supplied probs summed to one, so I changed the docs to refer to prob as sampling weights.

This also adds a rep_slice_sample() function, as @mine-cetinkaya-rundel recommended, that is a light wrapper around rep_sample_n() that more closely resembles the interface to dplyr::slice_sample(). I only changed argument names and ordering here.🦋

library(infer)
  
df <- tibble::tibble(
  ball_id = 1:5,
  color = factor(c(rep("a", 3), rep("b", 2)))
)

set.seed(1)
rep_sample_n(df, size = 2, reps = 3)
#> # A tibble: 6 x 3
#> # Groups:   replicate [3]
#>   replicate ball_id color
#>       <int>   <int> <fct>
#> 1         1       1 a    
#> 2         1       4 b    
#> 3         2       1 a    
#> 4         2       2 a    
#> 5         3       5 b    
#> 6         3       3 a

set.seed(1)
rep_sample_n(df, size = 2, reps = 3, prob = c(.9, rep(.025, 4)))
#> # A tibble: 6 x 3
#> # Groups:   replicate [3]
#>   replicate ball_id color
#>       <int>   <int> <fct>
#> 1         1       1 a    
#> 2         1       4 b    
#> 3         2       1 a    
#> 4         2       2 a    
#> 5         3       1 a    
#> 6         3       2 a

set.seed(1)
rep_sample_n(df, size = 2, reps = 3, prob = c(.9, rep(.025, 4)))
#> # A tibble: 6 x 3
#> # Groups:   replicate [3]
#>   replicate ball_id color
#>       <int>   <int> <fct>
#> 1         1       1 a    
#> 2         1       4 b    
#> 3         2       1 a    
#> 4         2       2 a    
#> 5         3       1 a    
#> 6         3       2 a

set.seed(1)
x <- rep_sample_n(tbl = df, size = 2, reps = 3)

set.seed(1)
x2 <- rep_slice_sample(.data = df, n = 2, reps = 3)

all.equal(x, x2)
#> [1] TRUE

Created on 2020-07-24 by the reprex package (v0.3.0.9001)

It also looks like codecov is picking up on not having any testing for the noLD check (not from this PR)—not sure what unit testing for that feature would look like.

also adds a new rep_slice_sample wrapper that has a more similar interface to dplyr::slice_sample().

still need to extend unit testing and rewrite examples.
@codecov-commenter
Copy link

codecov-commenter commented Jul 25, 2020

Codecov Report

Merging #325 into develop will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #325      +/-   ##
===========================================
- Coverage    99.91%   99.91%   -0.01%     
===========================================
  Files           15       15              
  Lines         1227     1222       -5     
===========================================
- Hits          1226     1221       -5     
  Misses           1        1              
Impacted Files Coverage Δ
R/rep_sample_n.R 100.00% <100.00%> (ø)

@simonpcouch
Copy link
Collaborator Author

By the way, the setup-R step on macOS-latest (devel) has been failing ~universally on GHA the past 3-4 days—something to do with availability of the daily builds.

R/rep_sample_n.R Outdated Show resolved Hide resolved
@ismayc
Copy link
Collaborator

ismayc commented Jul 25, 2020

With regard to getting this back to 100% coverage, don't you just need to set capabilities("long.double") to get this to check? Or is that what you are having trouble setting?

document argument aliases in the same param rather than duplicating descriptions
Copy link
Collaborator

@echasnovski echasnovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New implementation uses the most recent features of {dplyr} 1.0.0: slice_sample() and .before in mutate(). This is good in every day usage, but not necessarily so in package development, as {infer} will not work with version of {dplyr} less than 1.0.0. I believe this is not so big of a problem, as most of {infer} users will have latest {dplyr} version, but it still seems to be a good idea to not rely on very recent {dplyr} features. Nevertheless, thank you @unoe for your help! It is much appreciated.

It seems that removing these lines, as per @rudeboybert's original doubts, should have been enough. But still the work of redocumenting and adding tests is very useful. Thanks, @simonpcouch!

I'll add some changes and request re-review.

Update: @simonpcouch, could you, please, take a look at my changes?

@echasnovski echasnovski requested a review from ismayc July 25, 2020 11:04
@unoe
Copy link

unoe commented Jul 25, 2020

I completely understand your points and agree, @echasnovski, and thanks for all your work and help @simonpcouch.
At the beginning I considered simply using sample_n() rather than slice_sample(), which would remove the dependency on the newest {dplyr} version and make the implementation accessible to users of many {dplyr} versions.
However, this would have meant that rep_sample_n() would have needed to be updated again at some point since sample_n() is being retired in {dplyr}.

@simonpcouch
Copy link
Collaborator Author

Looks great! Thanks for the changes, @echasnovski.

Since @ismayc has given the thumbs up on both this PR and #326, I'm going to go ahead and merge #326 here and then this PR to develop. :-)

@simonpcouch simonpcouch merged commit acdc676 into develop Jul 25, 2020
@simonpcouch simonpcouch deleted the rep-sample-n branch July 25, 2020 17:25
@github-actions
Copy link

github-actions bot commented Mar 8, 2021

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants