read_* extremely slow when using `guess_max` parameter in v2.0 #1267

JoshuaSturm · 2021-08-06T14:59:01Z

Hi, readr team.

Apologies if this is the wrong place to report this bug, since it's likely a vroom issue.
Reading large dataframes in readr 2.0.0 is extremely slow. I narrowed it down to the guess_max parameter; omitting it in the call would eliminate the performance degradation. However, this can cause parsing issues for large files, so I tend to keep it.
Below is a reprex with benchmarks.

I'm almost certain I used vroom in the last month or two (prior to readr 2.0) with no issues, so I think it's a recent problem.

edited to fix reprex

library(readr)
library(reprex)
library(bench)

options(
  readr.show_col_types = FALSE
)

f <- file.path(tempdir(), "tempdf.csv")

sampleData <- do.call(data.frame, replicate(100L, rep(paste0(sample(c(LETTERS, 0L:9L), size = 9L, replace = T), collapse = ""), 250000L), simplify = FALSE)) |>
  write_csv(file = f)

mark(
  old  = with_edition(1, read_csv(f)),
  old2 = with_edition(1, read_csv(f, guess_max = 250000L))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old           2.04s    2.04s     0.490     216MB    0.980
#> 2 old2          3.79s    3.79s     0.264     402MB    0.264

mark(
  new  = read_csv(f),
  new2 = read_csv(f, guess_max = 250000L)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new           1.56s    1.56s   0.640      3.68MB    0.640
#> 2 new2          5.13m    5.13m   0.00325  191.03MB    0.980

^{Created on 2021-08-06 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.1.0 (2021-05-18)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/New_York            
#>  date     2021-08-06                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source                       
#>  backports     1.2.1      2020-12-09 [1] CRAN (R 4.1.0)               
#>  bench       * 1.1.1      2020-01-13 [1] CRAN (R 4.1.0)               
#>  bit           4.0.4      2020-08-04 [1] CRAN (R 4.1.0)               
#>  bit64         4.0.5      2020-08-30 [1] CRAN (R 4.1.0)               
#>  cli           3.0.1      2021-07-17 [1] CRAN (R 4.1.0)               
#>  crayon        1.4.1      2021-02-08 [1] CRAN (R 4.1.0)               
#>  digest        0.6.27     2020-10-24 [1] CRAN (R 4.1.0)               
#>  ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.1.0)               
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.1.0)               
#>  fansi         0.5.0      2021-05-25 [1] CRAN (R 4.1.0)               
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.1.0)               
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.1.0)               
#>  highr         0.9        2021-04-16 [1] CRAN (R 4.1.0)               
#>  hms           1.1.0      2021-05-17 [1] CRAN (R 4.1.0)               
#>  htmltools     0.5.1.1    2021-01-22 [1] CRAN (R 4.1.0)               
#>  knitr         1.33       2021-04-24 [1] CRAN (R 4.1.0)               
#>  lifecycle     1.0.0      2021-02-15 [1] CRAN (R 4.1.0)               
#>  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.1.0)               
#>  pillar        1.6.2      2021-07-29 [1] CRAN (R 4.1.0)               
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.1.0)               
#>  profmem       0.6.0      2020-12-13 [1] CRAN (R 4.1.0)               
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.1.0)               
#>  R6            2.5.0      2020-10-28 [1] CRAN (R 4.1.0)               
#>  readr       * 2.0.0      2021-07-20 [1] CRAN (R 4.1.0)               
#>  reprex      * 2.0.1      2021-08-05 [1] CRAN (R 4.1.0)               
#>  rlang         0.4.11     2021-04-30 [1] CRAN (R 4.1.0)               
#>  rmarkdown     2.10       2021-08-06 [1] CRAN (R 4.1.0)               
#>  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.1.0)               
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.1.0)               
#>  stringi       1.7.3      2021-07-16 [1] CRAN (R 4.1.0)               
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.1.0)               
#>  styler        1.5.1.9000 2021-08-03 [1] Github (r-lib/styler@a8ec068)
#>  tibble        3.1.3      2021-07-23 [1] CRAN (R 4.1.0)               
#>  tidyselect    1.1.1      2021-04-30 [1] CRAN (R 4.1.0)               
#>  tzdb          0.1.2      2021-07-20 [1] CRAN (R 4.1.0)               
#>  utf8          1.2.2      2021-07-24 [1] CRAN (R 4.1.0)               
#>  vctrs         0.3.8      2021-04-29 [1] CRAN (R 4.1.0)               
#>  vroom         1.5.4      2021-08-05 [1] CRAN (R 4.1.0)               
#>  withr         2.4.2      2021-04-18 [1] CRAN (R 4.1.0)               
#>  xfun          0.25       2021-08-06 [1] CRAN (R 4.1.0)               
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.1.0)               
#>

The text was updated successfully, but these errors were encountered:

jimhester · 2021-08-06T15:34:44Z

I am not sure that guessing with the entire file is a great strategy overall, you are basically parsing the whole file twice at least to do this, but this was clearly a major performance regression.

However thank you for opening the issue and for supplying a reproducible example, it is a big help and made tracking down the cause much more straightforward!

f <- file.path(tempdir(), "tempdf.csv")

sampleData <- do.call(data.frame, replicate(100L, rep(paste0(sample(c(LETTERS, 0L:9L), size = 9L, replace = T), collapse = ""), 250000L), simplify = FALSE)) |>
  vroom::vroom_write(file = f)

bench::mark(
  new  = vroom::vroom(f),
  new2 = vroom::vroom(f, guess_max = 250000L)
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new        224.88ms 224.97ms     3.99     6.84MB    1.33 
#> 2 new2          3.78s    3.78s     0.265  191.05MB    0.265

^{Created on 2021-08-06 by the reprex package (v2.0.0)}

The performance for this use case is still not ideal, but it should be greatly improved from the current release.

JoshuaSturm · 2021-08-06T17:13:01Z

Great point - I will start to explicitly define column types when possible.
Thanks for the quick resolution!

# vroom 1.5.7 * Jenny Bryan is now the official maintainer. * Fix uninitialized bool detected by CRAN's UBSAN check (tidyverse/vroom#386) * Fix buffer overflow when trying to parse an integer field that is over 64 characters long (tidyverse/readr#1326) * Fix subset indexing when indexes span a file boundary multiple times (#383) # vroom 1.5.6 * `vroom(col_select=)` now works if `col_names = FALSE` as intended (#381) * `vroom(n_max=)` now correctly handles cases when reading from a connection and the file does _not_ end with a newline (tidyverse/readr#1321) * `vroom()` no longer issues a spurious warning when the parsing needs * to be restarted due to the presence of embedded newlines * (tidyverse/readr#1313) Fix performance * issue when materializing subsetted vectors (#378) * `vroom_format()` now uses the same internal multi-threaded code as `vroom_write()`, improving its performance in most cases (#377) * `vroom_fwf()` no longer omits the last line if it does _not_ end with a newline (tidyverse/readr#1293) * Empty files or files with only a header line and no data no longer cause a crash if read with multiple files (tidyverse/readr#1297) * Files with a header but no contents, or a empty file if `col_names = FALSE` no longer cause a hang when `progress = TRUE` (tidyverse/readr#1297) * Commented lines with comments at the end of lines no longer hang R (tidyverse/readr#1309) * Comment lines containing unpaired quotes are no longer treated as unterminated quotations (tidyverse/readr#1307) * Values with only a `Inf` or `NaN` prefix but additional data afterwards, like `Inform` or no longer inappropriately guessed as doubles (tidyverse/readr#1319) * Time types now support `%h` format to denote hour durations greater than 24, like readr (tidyverse/readr#1312) * Fix performance issue when materializing subsetted vectors (#378) # vroom 1.5.5 * `vroom()` now supports files with only carriage return newlines (`\r`). (#360, tidyverse/readr#1236) * `vroom()` now parses single digit datetimes more consistently as readr has done (tidyverse/readr#1276) * `vroom()` now parses `Inf` values as doubles (tidyverse/readr#1283) * `vroom()` now parses `NaN` values as doubles (tidyverse/readr#1277) * `VROOM_CONNECTION_SIZE` is now parsed as a double, which supports scientific notation (#364) * `vroom()` now works around specifying a `\n` as the delimiter (#365, tidyverse/dplyr#5977) * `vroom()` no longer crashes if given a `col_name` and `col_type` both less than the number of columns (tidyverse/readr#1271) * `vroom()` no longer hangs if given an empty value for `locale(grouping_mark=)` (tidyverse/readr#1241) * Fix performance regression when guessing with large numbers of rows (tidyverse/readr#1267)

jimhester closed this as completed in tidyverse/vroom@f6e930a Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_* extremely slow when using `guess_max` parameter in v2.0 #1267

read_* extremely slow when using `guess_max` parameter in v2.0 #1267

JoshuaSturm commented Aug 6, 2021 •

edited

Loading

jimhester commented Aug 6, 2021

JoshuaSturm commented Aug 6, 2021

read_* extremely slow when using guess_max parameter in v2.0 #1267

read_* extremely slow when using guess_max parameter in v2.0 #1267

Comments

JoshuaSturm commented Aug 6, 2021 • edited Loading

jimhester commented Aug 6, 2021

JoshuaSturm commented Aug 6, 2021

read_* extremely slow when using `guess_max` parameter in v2.0 #1267

read_* extremely slow when using `guess_max` parameter in v2.0 #1267

JoshuaSturm commented Aug 6, 2021 •

edited

Loading