-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3760: [R] Support Arrow CSV reader #2949
ARROW-3760: [R] Support Arrow CSV reader #2949
Conversation
Tested this with default options against readr::read_csv and data.table::fread.
I get this:
With:
library(lobstr)
library(tibble)
n <- 5e7
tib <- tibble(x = rnorm(n), y = rnorm(n), z = 1:n + 1L, a = sample(letters, n, replace = TRUE))
lobstr::obj_size(tib)
readr::write_csv(tib, "data.csv")
system.time(
arrow::csv_read("data.csv")
)
system.time(
readr::read_csv("data.csv")
)
system.time(
data.table::fread("data.csv")
)
system.time(
read.csv("data.csv")
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool! I left some comments.
@pitrou may also want to have a look for the API
e8d6cc0
to
c70052e
Compare
c70052e
to
7212905
Compare
I'm so happy to see work moving forward on this. Thank you Romain and Wes for all the time and effort you are giving the community. Silly, possibly a bikeshed, comment / addressed elsewhere. Feel free to ignore. I can imagine this function name being one of those that drives me a bit nuts. |
It's unfortunate that |
cc @hadley for any thoughts on this |
I think I'd have a mild preference for |
Sounds fine to me |
d2f2b70
to
b8c2b5e
Compare
There are alternatives to loading the package:
|
Flagging this as WIP now because I need to rename a bunch of functions to align with the changes made in #3043 |
FWIW in my experience so far, https://github.com/smbache/import works like a dream. I use it mostly when writing scripts whose execution is non-interactive / expected to keep working with minimal maintenance. In package development and interactive work, it usually has felt more like a hassle than a win. |
480210b
to
d41723c
Compare
d41723c
to
bb13a76
Compare
Renamed the main function to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. @pitrou or others may be interested to do some profiling to compare the Arrow CSV reader with the data.table and readr R libraries
The main entry point is the `csv_read()` function, all it does is create a `csv::TableReader` with the `csv_table_reader()` generic and then `$Read()` from it. as in the apache#2947 for feather format, `csv_table_reader` is generic with the methods: - arrow::io::InputStream: calls the TableReader actor with the other options - character and fs_path: depending on the `mmap` option (TRUE by default) it opens the file with `mmap_open()` of `file_open()` and then calls the other method. ``` r library(arrow) tf <- tempfile() readr::write_csv(iris, tf) tab1 <- csv_read(tf) tab1 #> arrow::Table as_tibble(tab1) #> # A tibble: 150 x 5 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 5.1 3.5 1.4 0.2 setosa #> 2 4.9 3 1.4 0.2 setosa #> 3 4.7 3.2 1.3 0.2 setosa #> 4 4.6 3.1 1.5 0.2 setosa #> 5 5 3.6 1.4 0.2 setosa #> 6 5.4 3.9 1.7 0.4 setosa #> 7 4.6 3.4 1.4 0.3 setosa #> 8 5 3.4 1.5 0.2 setosa #> 9 4.4 2.9 1.4 0.2 setosa #> 10 4.9 3.1 1.5 0.1 setosa #> # … with 140 more rows ``` <sup>Created on 2018-11-13 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1.9000)</sup> Author: Romain Francois <romain@purrple.cat> Closes apache#2949 from romainfrancois/ARROW-3760/csv_reader and squashes the following commits: 951e9f5 <Romain Francois> s/csv_read/read_csv_arrow/ 7770ec5 <Romain Francois> not using readr:: at this point bb13a76 <Romain Francois> rebase 83b5162 <Romain Francois> s/file_open/ReadableFile/ 959020c <Romain Francois> No need to special use mmap for file path method 6e74003 <Romain Francois> going through CharacterVector makes sure this is a character vector 2585501 <Romain Francois> line breaks for readability 0ab8397 <Romain Francois> linting 09187e6 <Romain Francois> Expose arrow::csv::TableReader, functions csv_table_reader() + csv_read()
The main entry point is the
csv_read()
function, all it does is create acsv::TableReader
with thecsv_table_reader()
generic and then$Read()
from it.as in the #2947 for feather format,
csv_table_reader
is generic with the methods:mmap
option (TRUE by default) it opens the file withmmap_open()
offile_open()
and then calls the other method.Created on 2018-11-13 by the reprex package (v0.2.1.9000)