-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3439: [R] R language bindings for Feather format #2947
Conversation
Very nice. How's the performance compare for a ~1GB dataset with the current |
BTW I frequently cannot read or interpret the emojis in your commit messages, would it be ok to use normal text instead? :) |
No idea about the performance. Most of the work is done by the C++ library. I'll check that. |
This is so exciting! @romainfrancois for context on @wesm 's question, see this: wesm/feather#294 current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK. I would expect the performance to be worse than library(feather)
because memory mapping is not being used
FYI, when I switched the Feather stuff to run off the Arrow codebase, things got a lot faster http://wesmckinney.com/blog/feather-arrow-future/ Might try to replicate these benchmarks in R to see. We're missing some optimizations in R, though, like multithreaded conversions (there's JIRAs up for this) |
library(arrow, warn.conflicts = FALSE)
library(tibble)
tib <- tibble(x = 1:10, y = rnorm(10), z = letters[1:10])
tf <- tempfile()
write_feather(tib, tf)
tab <- read_feather(tf, mmap = TRUE)
tab
#> arrow::Table
as_tibble(tab)
#> # A tibble: 10 x 3
#> x y z
#> <int> <dbl> <chr>
#> 1 1 0.253 a
#> 2 2 0.988 b
#> 3 3 1.54 c
#> 4 4 -0.684 d
#> 5 5 -0.666 e
#> 6 6 -0.763 f
#> 7 7 0.833 g
#> 8 8 2.11 h
#> 9 9 0.887 i
#> 10 10 1.41 j Created on 2018-11-12 by the reprex package (v0.2.1.9000) |
Cool, I will review again when I can, I'll be interested to see the perf numbers; it might be useful to start a directory of benchmarks where we can periodically check the performance of things |
library(lobstr)
library(tibble)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
#> The following objects are masked from 'package:base':
#>
#> array, table
library(feather)
#>
#> Attaching package: 'feather'
#> The following objects are masked from 'package:arrow':
#>
#> read_feather, write_feather
n <- 5e7
tib <- tibble(x = rnorm(n), y = rnorm(n), z = 1:n + 1L, a = sample(letters, n, replace = TRUE))
lobstr::obj_size(tib)
#> 1,400,002,664 B
write_feather(tib, "data.feather")
fs::file_info("data.feather")
#> # A tibble: 1 x 18
#> path type size permissions modification_time user group
#> <fs::path> <fct> <fs:> <fs::perms> <dttm> <chr> <chr>
#> 1 data.feather file 1.16G rw-r--r-- 2018-11-13 15:19:50 roma… staff
#> # … with 11 more variables: device_id <dbl>, hard_links <dbl>,
#> # special_device_id <dbl>, inode <dbl>, block_size <dbl>, blocks <dbl>,
#> # flags <int>, generation <dbl>, access_time <dttm>, change_time <dttm>,
#> # birth_time <dttm>
bench::mark(check = FALSE,
feather::read_feather("data.feather"),
arrow::read_feather("data.feather"),
as_tibble(arrow::read_feather("data.feather"))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr
#> <chr> <bch:> <bch> <bch:> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 "feather:… 5.34s 5.34s 5.34s 5.34s 0.187 1.3GB 2 1
#> 2 "arrow::r… 1.38ms 7.3ms 1.6ms 426.94ms 137. 8.72MB 3 104
#> 3 "as_tibbl… 2.53s 2.53s 2.53s 2.53s 0.395 1.3GB 0 1
#> # … with 1 more variable: total_time <bch:tm> Created on 2018-11-13 by the reprex package (v0.2.1.9000) |
so if we add data copy we get the 3rd line, still twice as fast as feather, but not that impressive. However, this is not using altrep at all yet, which it really should so that the memory of the columns of the |
Well, being twice as fast is pretty good, as in theory the libraries are doing the same thing. Can you run a benchmark where about 50% of the values are NA? |
library(lobstr)
library(tibble)
library(arrow)
library(feather)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
n <- 5e7
tib <- tibble(x = rnorm(n), y = rnorm(n), z = 1:n + 1L, a = sample(letters, n, replace = TRUE))
tib <- tib %>%
mutate_all(~{ .[sample(length(.), length(.)/2L)] <- NA ; .})
lobstr::obj_size(tib)
#> 1,400,002,720 B
summarise_all(tib, ~sum(is.na(.)) / n())
#> # A tibble: 1 x 4
#> x y z a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.5 0.5 0.5 0.5
arrow::write_feather(tib, "data-na50.feather")
fs::file_info("data-na50.feather")
#> # A tibble: 1 x 18
#> path type size permissions modification_time user group
#> <fs::path> <fct> <fs:> <fs::perms> <dttm> <chr> <chr>
#> 1 data-na50.feather file 1.16G rw-r--r-- 2018-11-13 15:46:20 roma… staff
#> # … with 11 more variables: device_id <dbl>, hard_links <dbl>,
#> # special_device_id <dbl>, inode <dbl>, block_size <dbl>, blocks <dbl>,
#> # flags <int>, generation <dbl>, access_time <dttm>, change_time <dttm>,
#> # birth_time <dttm>
bench::mark(check = FALSE,
feather = feather::read_feather("data-na50.feather"),
# arrow::read_feather("data-na50.feather"),
arrow = as_tibble(arrow::read_feather("data-na50.feather"))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr
#> <chr> <bch> <bch> <bch:> <bch> <dbl> <bch:byt> <dbl> <int>
#> 1 feather 4.15s 4.15s 4.15s 4.15s 0.241 1.3GB 1 1
#> 2 arrow 3.51s 3.51s 3.51s 3.51s 0.285 1.31GB 1 1
#> # … with 1 more variable: total_time <bch:tm> But at the moment all this: happens in one thread, e.g. // first copy all the data
std::copy_n(p_values, n, data.begin() + start);
if (null_count) {
// then set the sentinel NA
arrow::internal::BitmapReader bitmap_reader(array->null_bitmap()->data(),
array->offset(), n);
for (size_t i = 0; i < n; i++, bitmap_reader.Next()) {
if (bitmap_reader.IsNotSet()) {
data[i + start] = default_value<RTYPE>();
}
}
} There's room for improvement here, even with copies (no altrep). But this is for later. We focus on the features first I suppose. |
Multithreading (https://issues.apache.org/jira/browse/ARROW-3315) will help a lot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you!
SetDescription = function(description) ipc___feather___TableWriter__SetDescription(self, description), | ||
SetNumRows = function(num_rows) ipc___feather___TableWriter__SetNumRows(self, num_rows), | ||
Append = function(name, values) ipc___feather___TableWriter__Append(self, name, values), | ||
Finalize = function() ipc___feather___TableWriter__Finalize(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a rationale for using 3 underscores in these methods like this (instead of 1 or 2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of:
- 3 underscores is for namespaces, so
ipc::
becomesipc___
- 2 underscores is for methods of a class, so
TableWriter.Finalize
becomesTableWriter__Finalize
- 1 underscore is sometimes used in the functions themselves
As of now, this is all generated manually, but maybe some day I'll sit down and have a way to generate these things (the R6 class, and the C++ functions) from a simple format. Something like what RcppR6 does, although IIRC we would need additional tooling for things like shared_ptr or unique_ptr, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is this an internal convention or an Rcpp thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not. There are no convention, and also no way (IIRC) to export a function that lives in a namespace. That's just something semi recognizable and easier to generate automatically if at some point we go that way (à la RcppR6)
#' @return an arrow::Table | ||
#' | ||
#' @export | ||
read_feather <- function(file, ...){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that the semantics of this function are different from feather::read_feather
(it returns an arrow::Table
instead of data.frame
), if we care
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it would be a nice incentive for people to abandon the feather
package in favor of arrow
if we could say arrow::read_feather()
/ arrow::write_feathe()r
are drop-in replacements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the goals of arrow
is to make Array structures palpable in R, which feather didn't do. But yeah I guess we need both, i.e.
- a function that returns an
arrow::Table
, not sure how to call this one. - a function, perhaps
read_feather
if this need to look likefeather::read_feather
that returns a tibble by calling as_tibble on the result of the first one.
Right now, it makes sense to go straight to tibble, but soon enough this (or other) packages will have more tooling to manipulate and compute directly on arrow memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add an as_tibble = TRUE
argument to read_feather
so this can be toggled on and off
Hmm, CI is failing, I just restarted the job. Where do we stand on #2770? It would be nice to be able to easily run the unit tests locally on a machine with Docker |
It's usually very benign, e.g. here it's Rd consistency https://travis-ci.org/apache/arrow/jobs/454386405#L2169 I think the rate limit problem is fixed now (@jimhester can you confirm?) so perhaps we can revert #2846 ? |
No idea what that new error is, but it's not R related. |
… heavier Exporter class
…open instead of file_open if mmap = TRUE
e92552f
to
8d708e2
Compare
Merging. CI failed this time due to rate limit
|
If you delete the cache for the R builds it should fix the rate limiting, it looks like it is still using an older version of the remotes package, which is what devtools uses under the hood. |
@romainfrancois Looks like this PR makes the R arrow bindings depend on an unreleased-version of |
Well, it wouldn't be appropriate for the R library to release without the C++ libraries also releasing (unless R starts releasing separately... I would advise against for now). So for the time being we should be developing master-to-master until the next Arrow release |
The main entry point is the `csv_read()` function, all it does is create a `csv::TableReader` with the `csv_table_reader()` generic and then `$Read()` from it. as in the #2947 for feather format, `csv_table_reader` is generic with the methods: - arrow::io::InputStream: calls the TableReader actor with the other options - character and fs_path: depending on the `mmap` option (TRUE by default) it opens the file with `mmap_open()` of `file_open()` and then calls the other method. ``` r library(arrow) tf <- tempfile() readr::write_csv(iris, tf) tab1 <- csv_read(tf) tab1 #> arrow::Table as_tibble(tab1) #> # A tibble: 150 x 5 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 5.1 3.5 1.4 0.2 setosa #> 2 4.9 3 1.4 0.2 setosa #> 3 4.7 3.2 1.3 0.2 setosa #> 4 4.6 3.1 1.5 0.2 setosa #> 5 5 3.6 1.4 0.2 setosa #> 6 5.4 3.9 1.7 0.4 setosa #> 7 4.6 3.4 1.4 0.3 setosa #> 8 5 3.4 1.5 0.2 setosa #> 9 4.4 2.9 1.4 0.2 setosa #> 10 4.9 3.1 1.5 0.1 setosa #> # … with 140 more rows ``` <sup>Created on 2018-11-13 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1.9000)</sup> Author: Romain Francois <romain@purrple.cat> Closes #2949 from romainfrancois/ARROW-3760/csv_reader and squashes the following commits: 951e9f5 <Romain Francois> s/csv_read/read_csv_arrow/ 7770ec5 <Romain Francois> not using readr:: at this point bb13a76 <Romain Francois> rebase 83b5162 <Romain Francois> s/file_open/ReadableFile/ 959020c <Romain Francois> No need to special use mmap for file path method 6e74003 <Romain Francois> going through CharacterVector makes sure this is a character vector 2585501 <Romain Francois> line breaks for readability 0ab8397 <Romain Francois> linting 09187e6 <Romain Francois> Expose arrow::csv::TableReader, functions csv_table_reader() + csv_read()
The main entry point is the `csv_read()` function, all it does is create a `csv::TableReader` with the `csv_table_reader()` generic and then `$Read()` from it. as in the apache#2947 for feather format, `csv_table_reader` is generic with the methods: - arrow::io::InputStream: calls the TableReader actor with the other options - character and fs_path: depending on the `mmap` option (TRUE by default) it opens the file with `mmap_open()` of `file_open()` and then calls the other method. ``` r library(arrow) tf <- tempfile() readr::write_csv(iris, tf) tab1 <- csv_read(tf) tab1 #> arrow::Table as_tibble(tab1) #> # A tibble: 150 x 5 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 5.1 3.5 1.4 0.2 setosa #> 2 4.9 3 1.4 0.2 setosa #> 3 4.7 3.2 1.3 0.2 setosa #> 4 4.6 3.1 1.5 0.2 setosa #> 5 5 3.6 1.4 0.2 setosa #> 6 5.4 3.9 1.7 0.4 setosa #> 7 4.6 3.4 1.4 0.3 setosa #> 8 5 3.4 1.5 0.2 setosa #> 9 4.4 2.9 1.4 0.2 setosa #> 10 4.9 3.1 1.5 0.1 setosa #> # … with 140 more rows ``` <sup>Created on 2018-11-13 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1.9000)</sup> Author: Romain Francois <romain@purrple.cat> Closes apache#2949 from romainfrancois/ARROW-3760/csv_reader and squashes the following commits: 951e9f5 <Romain Francois> s/csv_read/read_csv_arrow/ 7770ec5 <Romain Francois> not using readr:: at this point bb13a76 <Romain Francois> rebase 83b5162 <Romain Francois> s/file_open/ReadableFile/ 959020c <Romain Francois> No need to special use mmap for file path method 6e74003 <Romain Francois> going through CharacterVector makes sure this is a character vector 2585501 <Romain Francois> line breaks for readability 0ab8397 <Romain Francois> linting 09187e6 <Romain Francois> Expose arrow::csv::TableReader, functions csv_table_reader() + csv_read()
Created on 2018-11-12 by the reprex package (v0.2.1.9000)