Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3760: [R] Support Arrow CSV reader #2949

Closed

Conversation

romainfrancois
Copy link
Contributor

The main entry point is the csv_read() function, all it does is create a csv::TableReader with the csv_table_reader() generic and then $Read() from it.

as in the #2947 for feather format, csv_table_reader is generic with the methods:

  • arrow::io::InputStream: calls the TableReader actor with the other options
  • character and fs_path: depending on the mmap option (TRUE by default) it opens the file with mmap_open() of file_open() and then calls the other method.
library(arrow)
tf <- tempfile()
readr::write_csv(iris, tf)

tab1 <- csv_read(tf)
tab1
#> arrow::Table
as_tibble(tab1)
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

Created on 2018-11-13 by the reprex package (v0.2.1.9000)

@romainfrancois
Copy link
Contributor Author

romainfrancois commented Nov 13, 2018

Tested this with default options against readr::read_csv and data.table::fread.

  • readr::read_csv and data.table::fread read into a data frame
  • arrow::csv_read reads into an arrow::Table

I get this:

romain@purrplex /tmp $ Rscript csv-write.R
1,400,002,600 B

romain@purrplex /tmp $ Rscript csv-read-readr.R
Parsed with column specification:
cols(
  x = col_double(),
  y = col_double(),
  z = col_integer(),
  a = col_character()
)
   user  system elapsed
 30.440   1.688  32.150

romain@purrplex /tmp $ Rscript csv-read-datatable.R
   user  system elapsed
 11.794   1.251   2.028

romain@purrplex /tmp $ Rscript csv-read-arrow.R
   user  system elapsed
 30.018   5.392   5.383

romain@purrplex /tmp $ Rscript csv-read-base.R
    user   system  elapsed
 886.716 1355.421 2795.491

With:

  • csv-write.R
library(lobstr)
library(tibble)

n <- 5e7
tib <- tibble(x = rnorm(n), y = rnorm(n), z = 1:n + 1L, a = sample(letters, n, replace = TRUE))
lobstr::obj_size(tib)

readr::write_csv(tib, "data.csv")
  • csv-read-arrow.R:
system.time(
  arrow::csv_read("data.csv")
)
  • csv-read-readr.R:
system.time(
  readr::read_csv("data.csv")
)
  • cdv-read-datatable.R:
system.time(
  data.table::fread("data.csv")
)
  • csv-read-base.R:
system.time(
  read.csv("data.csv")
)

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! I left some comments.

@pitrou may also want to have a look for the API

r/R/csv.R Outdated Show resolved Hide resolved
r/R/csv.R Show resolved Hide resolved
r/R/csv.R Outdated Show resolved Hide resolved
r/src/csv.cpp Show resolved Hide resolved
@romainfrancois romainfrancois force-pushed the ARROW-3760/csv_reader branch 5 times, most recently from e8d6cc0 to c70052e Compare November 19, 2018 11:18
@russellpierce
Copy link

I'm so happy to see work moving forward on this. Thank you Romain and Wes for all the time and effort you are giving the community.

Silly, possibly a bikeshed, comment / addressed elsewhere. Feel free to ignore.

I can imagine this function name being one of those that drives me a bit nuts. csv_read means almost the same thing as read.csv and read_csv. I'm sensitive to masking issues, but do we need another confusable function name? People can fully qualify as arrow::read_csv() or if we want to give in to the feeling that most folks do global imports in R we could do something like arrow_read_csv()?

@wesm
Copy link
Member

wesm commented Nov 28, 2018

It's unfortunate that library(X) in R is the moral equivalent of from X import * in Python (which is generally discouraged). Probably arrow::read_csv is best for consistency, and we will have to live with the readr name conflict

@wesm
Copy link
Member

wesm commented Nov 28, 2018

cc @hadley for any thoughts on this

@hadley
Copy link
Contributor

hadley commented Nov 28, 2018

I think I'd have a mild preference for read_csv_arrow() (I think a suffix makes more sense because then it will appear in autocomplete in after typing read_csv()).

@wesm
Copy link
Member

wesm commented Nov 28, 2018

Sounds fine to me

@romainfrancois romainfrancois force-pushed the ARROW-3760/csv_reader branch 2 times, most recently from d2f2b70 to b8c2b5e Compare December 4, 2018 20:25
@romainfrancois romainfrancois added the WIP PR is work in progress label Dec 4, 2018
@romainfrancois
Copy link
Contributor Author

There are alternatives to loading the package:

  • when you are creating a package, you can selectively import a subset of functions.
  • the import package: https://github.com/smbache/import (haven't tried it myself) but it gives IIUC something close to the typical python workflow.

@romainfrancois
Copy link
Contributor Author

Flagging this as WIP now because I need to rename a bunch of functions to align with the changes made in #3043

@russellpierce
Copy link

FWIW in my experience so far, https://github.com/smbache/import works like a dream. I use it mostly when writing scripts whose execution is non-interactive / expected to keep working with minimal maintenance. In package development and interactive work, it usually has felt more like a hassle than a win.

@romainfrancois romainfrancois force-pushed the ARROW-3760/csv_reader branch 2 times, most recently from 480210b to d41723c Compare December 10, 2018 18:12
@romainfrancois romainfrancois added ready-for-review and removed WIP PR is work in progress labels Jan 2, 2019
@romainfrancois
Copy link
Contributor Author

Renamed the main function to read_csv_arrow()

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. @pitrou or others may be interested to do some profiling to compare the Arrow CSV reader with the data.table and readr R libraries

@wesm wesm closed this in fba4f32 Jan 4, 2019
@romainfrancois romainfrancois deleted the ARROW-3760/csv_reader branch January 7, 2019 15:06
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Jan 10, 2019
The main entry point is the `csv_read()` function, all it does is create a `csv::TableReader` with the `csv_table_reader()` generic and then `$Read()` from it.

as in the apache#2947 for feather format, `csv_table_reader` is generic with the methods:
 - arrow::io::InputStream: calls the TableReader actor with the other options
 - character and fs_path: depending on the `mmap` option (TRUE by default) it opens the file with `mmap_open()` of `file_open()` and then calls the other method.

``` r
library(arrow)
tf <- tempfile()
readr::write_csv(iris, tf)

tab1 <- csv_read(tf)
tab1
#> arrow::Table
as_tibble(tab1)
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <chr>
#>  1          5.1         3.5          1.4         0.2 setosa
#>  2          4.9         3            1.4         0.2 setosa
#>  3          4.7         3.2          1.3         0.2 setosa
#>  4          4.6         3.1          1.5         0.2 setosa
#>  5          5           3.6          1.4         0.2 setosa
#>  6          5.4         3.9          1.7         0.4 setosa
#>  7          4.6         3.4          1.4         0.3 setosa
#>  8          5           3.4          1.5         0.2 setosa
#>  9          4.4         2.9          1.4         0.2 setosa
#> 10          4.9         3.1          1.5         0.1 setosa
#> # … with 140 more rows
```

<sup>Created on 2018-11-13 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1.9000)</sup>

Author: Romain Francois <romain@purrple.cat>

Closes apache#2949 from romainfrancois/ARROW-3760/csv_reader and squashes the following commits:

951e9f5 <Romain Francois> s/csv_read/read_csv_arrow/
7770ec5 <Romain Francois> not using readr:: at this point
bb13a76 <Romain Francois> rebase
83b5162 <Romain Francois> s/file_open/ReadableFile/
959020c <Romain Francois> No need to special use mmap for file path method
6e74003 <Romain Francois> going through CharacterVector makes sure this is a character vector
2585501 <Romain Francois> line breaks for readability
0ab8397 <Romain Francois> linting
09187e6 <Romain Francois> Expose arrow::csv::TableReader, functions csv_table_reader() + csv_read()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants