Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect first n rows of a record? #182

Closed
hgriesbauer opened this issue Apr 8, 2020 · 19 comments · Fixed by #186
Closed

Collect first n rows of a record? #182

hgriesbauer opened this issue Apr 8, 2020 · 19 comments · Fixed by #186

Comments

@hgriesbauer
Copy link
Contributor

Hi - when using bcdc_query_geodata %>% collect(), is it possible to somehow collect just the first n rows of the record? This would be extremely helpful to take a quick look at the data, see how its structured, see what type of values are in there, etc... The current code returns the top 6 lines in the browser, which is really useful, however, collecting the first ~500 records would be really handy!

Something like:
bcdc_query_geodata("results-forest-cover-silviculture") %>%
collect(top_n(500))

@boshek
Copy link
Collaborator

boshek commented Apr 9, 2020

👋 @hgriesbauer

We did actually talk about this here #41. dplyr 1.0.0 is actually going to be deprecating top_n and also it isn't a generic so we should really be making a method for it. Rather dplyr is moving to slice_head and slice_tail functions: tidyverse/dplyr#4687

@ateucher One thing we could do is use slice_head and slice_tail for this. The real downside to this is that it would force a hard dependency on dplyr 1.0.0.

@ateucher
Copy link
Collaborator

ateucher commented Apr 9, 2020

I say we do a top_n method for now (will pass a LIMIT N or equivalent to the CQL I assume). Depracated means it will be around for a while yet... once dplyr 1.0 is out for a while we can switch to slice_head (will be the same as top_n right?) and add slice_tail?

@boshek
Copy link
Collaborator

boshek commented Apr 9, 2020

Sorry @ateucher - I wrote top_n is a generic when I meant it isn't a generic

@ateucher
Copy link
Collaborator

ateucher commented Apr 9, 2020

Oh well that is annoying. What about just a head method then? It is a generic and isn't going to change with dplyr. We can still add slice_*() later.

@boshek
Copy link
Collaborator

boshek commented Apr 9, 2020

Sure. And slice_head() for our purposes would just be slice_head <- head I think since we can't group_by before collect

@ateucher
Copy link
Collaborator

ateucher commented Apr 9, 2020

Yup, and when I said in the CQL it would be LIMIT n it would actually be COUNT = n right?

@boshek
Copy link
Collaborator

boshek commented Apr 16, 2020

What would the syntax be here? I see three options:

air <- bcdc_get_data('bc-airports', resource = '4d0377d9-e8a1-429b-824f-0ce8f363512c')

air %>% 
  head()

Issue with ☝️ is that collects without collect.

air %>% 
  head() %>% 
  collect()

☝️ is just weird

air %>% 
  collect(n = 10)

But what about this approach? Can you add arguments to a method? Not set on n as the argument name but maybe we could specify in the final collect call. Feels more natural than the other approaches.

@stephhazlitt
Copy link
Member

air %>% collect(head()) Can collect take a function? 🤷

@ateucher
Copy link
Collaborator

ateucher commented Apr 17, 2020

You can add arguments to a method (as long as the generic has ..., which the collect generic does).

air %>% collect(head()) would be the same as air %>% head() %>% collect()...

which is what dbplyr's head method does - it doesn't collect, just returns a tbl_lazy. So basically your second example (the weird one):

library(tidyverse)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
mtcars_db <- copy_to(con, rownames_to_column(mtcars), "mtcars")

head_mtcars_db <- head(mtcars_db)

head_mtcars_db
#> # Source:   lazy query [?? x 12]
#> # Database: sqlite 3.30.1 [:memory:]
#>   rowname        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#> 6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
class(head_mtcars_db)
#> [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
#> [4] "tbl_lazy"             "tbl"
  
collect(head_mtcars_db)
#> # A tibble: 6 x 12
#>   rowname        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#> 6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

Created on 2020-04-17 by the reprex package (v0.3.0)

@stephhazlitt
Copy link
Member

I am not sure I think

air %>% 
  head() %>% 
  collect()

is actually weird— feels like it is consistent with the design wrt to collect()?

@ateucher
Copy link
Collaborator

I agree, I think it's the right way

@boshek
Copy link
Collaborator

boshek commented Apr 20, 2020

I do see how this is how dbplyr implements this. I, do, however think that there could be some discussion on whether this is right. By calling collect as the last step, you are actually calling head solely for side effects which is what feels weird to me. I acknowledge that it does at least return an object of the same class but I think having to call collect after head is the least intuitive of the options. I also worry about superceding a base R method with a package method. This idea itself is implemented unevenly in dbplyr:

R> mtcars_db %>% 
   ncol() 
[1] 12

@boshek
Copy link
Collaborator

boshek commented Apr 21, 2020

So we've basically landed on supporting both this:

air <- bcdc_get_data('bc-airports', resource = '4d0377d9-e8a1-429b-824f-0ce8f363512c')

air %>% 
  head()

and this:
air %>%
head() %>%
collect()


Am I right there?

1 similar comment
@boshek
Copy link
Collaborator

boshek commented Apr 21, 2020

So we've basically landed on supporting both this:

air <- bcdc_get_data('bc-airports', resource = '4d0377d9-e8a1-429b-824f-0ce8f363512c')

air %>% 
  head()

and this:
air %>%
head() %>%
collect()


Am I right there?

@stephhazlitt
Copy link
Member

So head() calls collect() in the background? Will this be the case for other methods/functions in the future?

@boshek
Copy link
Collaborator

boshek commented Apr 21, 2020

So we've basically landed on supporting both this:

air <- bcdc_get_data('bc-airports', resource = '4d0377d9-e8a1-429b-824f-0ce8f363512c')

air %>% 
  head()

and this:

air %>% 
  head() %>% 
  collect()

Am I right there?

@ateucher
Copy link
Collaborator

That is where we landed but I wonder if we're just fence-sitting? It still seems strange to me that head would be the only method that does this...

@ateucher
Copy link
Collaborator

OK, final decision: Mimic dbplyr behaviour and require calling collect() after head()

@boshek
Copy link
Collaborator

boshek commented Apr 21, 2020

And for posterity the reason for this is because head() returns a table (data.frame) just like filter, select so should require the additional step of collect. If we implemented other methods like dim or str we wouldn't require a collect because they don't return a data.frame. collect is inextricably tied to the data.frames.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants