Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R session crashing when reading a bunch of csv files when one of the files has no rows #1297

Closed
gorkang opened this issue Sep 7, 2021 · 10 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@gorkang
Copy link

gorkang commented Sep 7, 2021

This has been hard to reduce to a shareable reprex, but this is as simple as I could get it.

I have 3 csv files. Two of them have 1 or more rows of data, and one has no rows of data (only has column names).

If trying to read all of them with read_csv(files) (possible since readr 2.0.0?), the R session crashes. If I read them with map_df(files, read_csv), all is well.

These are the files used in the example: CSV4.zip

  library(readr)
  library(purrr)
  suppressPackageStartupMessages(library(dplyr))
  suppressPackageStartupMessages(library(here))
  
  
  files_giftcards = list.files(here::here("dev/BUG/CSV4/"), full.names = TRUE)
  
  DF12 = read_csv(files_giftcards[1:2], 
                 col_types = 
                   cols(
                     .default = col_character()
                   ))
  
  DF12
#> # A tibble: 1 × 1
#>   id          
#>   <chr>       
#> 1 rwsf7qgy2hsv
  
  DF2 = read_csv(files_giftcards[2], 
                 col_types = 
                   cols(
                     .default = col_character()
                   ))
  
  DF2 
#> # A tibble: 0 × 1
#> # … with 1 variable: id <chr>
  
  DF3 = read_csv(files_giftcards[3], 
                 col_types = 
                   cols(
                     .default = col_character()
                   ))
  
  DF3 
#> # A tibble: 1 × 1
#>   id          
#>   <chr>       
#> 1 rwsf7qgy2hsv
  

# FAILS -------------------------------------------------------------------
  
  DF13 = read_csv(files_giftcards[1:3], 
                 col_types = 
                   cols(
                     .default = col_character()
                   ))
  
  # DF13 # CRASHES R
  
  
  DF13 = read_csv(files_giftcards[1:3])
#> Rows: 2 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): id
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  
  
# Using map_df makes this work ----------------------------------------------
  
  DF13_2 = map_df(files_giftcards[1:3], read_csv, 
                  col_types = 
                    cols(
                      .default = col_character()
                    ))
  
  DF13_2 # WORKS!
#> # A tibble: 2 × 1
#>   id          
#>   <chr>       
#> 1 rwsf7qgy2hsv
#> 2 rwsf7qgy2hsv

Created on 2021-09-07 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.1.1 (2021-08-10)
#>  os       Ubuntu 20.04.3 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Atlantic/Canary             
#>  date     2021-09-07                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  archive       1.1.0   2021-08-05 [1] CRAN (R 4.1.0)
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
#>  backports     1.2.1   2020-12-09 [1] CRAN (R 4.1.0)
#>  bit           4.0.4   2020-08-04 [1] CRAN (R 4.1.0)
#>  bit64         4.0.5   2020-08-30 [1] CRAN (R 4.1.0)
#>  cli           3.0.1   2021-07-17 [1] CRAN (R 4.1.0)
#>  crayon        1.4.1   2021-02-08 [1] CRAN (R 4.1.0)
#>  DBI           1.1.1   2021-01-15 [1] CRAN (R 4.1.0)
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.1.0)
#>  dplyr       * 1.0.7   2021-06-18 [1] CRAN (R 4.1.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.1.0)
#>  fansi         0.5.0   2021-05-25 [1] CRAN (R 4.1.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.1.0)
#>  generics      0.1.0   2020-10-31 [1] CRAN (R 4.1.0)
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.1.0)
#>  here        * 1.0.1   2020-12-13 [1] CRAN (R 4.1.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
#>  hms           1.1.0   2021-05-17 [1] CRAN (R 4.1.0)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.1)
#>  knitr         1.33    2021-04-24 [1] CRAN (R 4.1.0)
#>  lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.1.0)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
#>  pillar        1.6.2   2021-07-29 [1] CRAN (R 4.1.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
#>  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.1)
#>  readr       * 2.0.1   2021-08-10 [1] CRAN (R 4.1.1)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.0)
#>  rlang         0.4.11  2021-04-30 [1] CRAN (R 4.1.0)
#>  rmarkdown     2.10    2021-08-06 [1] CRAN (R 4.1.1)
#>  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.1.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.1.0)
#>  stringi       1.7.4   2021-08-25 [1] CRAN (R 4.1.1)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
#>  styler        1.5.1   2021-07-13 [1] CRAN (R 4.1.0)
#>  tibble        3.1.4   2021-08-25 [1] CRAN (R 4.1.1)
#>  tidyselect    1.1.1   2021-04-30 [1] CRAN (R 4.1.0)
#>  tzdb          0.1.2   2021-07-20 [1] CRAN (R 4.1.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
#>  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
#>  vroom         1.5.4   2021-08-05 [1] CRAN (R 4.1.1)
#>  withr         2.4.2   2021-04-18 [1] CRAN (R 4.1.0)
#>  xfun          0.25    2021-08-06 [1] CRAN (R 4.1.1)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.1.0)
#> 
#> [1] /home/emrys/R/x86_64-pc-linux-gnu-library/4.1
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
@thinkelman-ESA
Copy link

I just encountered this same issue. You can download the file that caused the problem for me from here. The problem code was simply readr::read_csv("Example.csv"). I encountered the problem on Windows 10, R 4.1.0, readr 2.0.1.

@jimhester jimhester added the bug an unexpected problem or unintended behavior label Sep 20, 2021
@ShinyFabio
Copy link

Hi gorkang I had a similar problem with empty files and R that frezees. #1305 check it. I solved setting a number (even a very big number like 10e6) in the n_max parameter.

@gorkang
Copy link
Author

gorkang commented Sep 23, 2021

Thanks ShinyFabio for the suggestion. Sadly, using the n_max parameter with 10e6 does not solve the problem.

Using the files attached in the first message, the R session crashes when doing:

  library(readr)
  suppressPackageStartupMessages(library(dplyr))
  suppressPackageStartupMessages(library(here))
  
  
  files_giftcards = list.files(here::here("dev/BUG/CSV4/"), full.names = TRUE)

  DF13 = read_csv(files_giftcards[1:3], n_max = 10e6,
                 col_types = 
                   cols(
                     .default = col_character()
                   ))
  
  DF13 # CRASHES R

Update: Tried with the dev version 2.0.1.9000 and the problem is still there.

@Adafede
Copy link

Adafede commented Sep 27, 2021

Having a similar issue here, with:

Error: C stack usage 7976356 is too close to the limit

@yogat3ch
Copy link

yogat3ch commented Oct 5, 2021

A single line CSV causes readr 2.0.2 (from CRAN) to soft hang the R session too. I replicated the file using write and read it with read_csv and it worked fine but reading the file as is just hangs the session. Not sure if it has something to do with line endings or what?

@dominikbach
Copy link

dominikbach commented Oct 15, 2021

I encounter the same issue: R freezes when reading a file with only one column name but no entries with read_csv("emtpytestfile.csv").

This happens on Windows 10 with tidyverse 1.3.1 and R 4.0.5 but using the same file and code, not Mac OS 11.5.2 with tidyverse 1.3.1 and R 4.1.1.

It was solved by setting n_max to a finite number.

@yogat3ch
Copy link

Interesting, thanks for finding a workaround @dominikbach !

@eihwood
Copy link

eihwood commented Oct 25, 2021

I am encountering the same issue:
OS 11.6 Big Sur
R Version 4.1.1
readr Version: 2.0.2

Reading in 18 csv files, one of which has column headers but no rows of data. R freezes and must crash/restart. Also happens with vroom version 1.5.5

@jimhester
Copy link
Collaborator

There were two separate issues here.

The first was an issue with windows line endings containing only one line and an interaction with the vroom progress bar that caused a hang in the R process.

The second issue was a crash due to invalid indexing when reading multiple files and one of the input files was empty or had only a header line.

Both issues should now be fixed in the next released version of vroom.

@yogat3ch
Copy link

Thanks for solving this @jimhester!

dfv-ms added a commit to dfv-ms/piwikproR that referenced this issue Nov 12, 2021
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue May 1, 2022
# vroom 1.5.7

* Jenny Bryan is now the official maintainer.

* Fix uninitialized bool detected by CRAN's UBSAN check
  (tidyverse/vroom#386)

* Fix buffer overflow when trying to parse an integer field that is
  over 64 characters long
  (tidyverse/readr#1326)

* Fix subset indexing when indexes span a file boundary multiple times
  (#383)

# vroom 1.5.6

* `vroom(col_select=)` now works if `col_names = FALSE` as intended (#381)

* `vroom(n_max=)` now correctly handles cases when reading from a
  connection and the file does _not_ end with a newline
  (tidyverse/readr#1321)

* `vroom()` no longer issues a spurious warning when the parsing needs
* to be restarted due to the presence of embedded newlines
* (tidyverse/readr#1313) Fix performance
* issue when materializing subsetted vectors (#378)

* `vroom_format()` now uses the same internal multi-threaded code as
  `vroom_write()`, improving its performance in most cases (#377)

* `vroom_fwf()` no longer omits the last line if it does _not_ end
  with a newline (tidyverse/readr#1293)

* Empty files or files with only a header line and no data no longer
  cause a crash if read with multiple files
  (tidyverse/readr#1297)

* Files with a header but no contents, or a empty file if `col_names =
  FALSE` no longer cause a hang when `progress = TRUE`
  (tidyverse/readr#1297)

* Commented lines with comments at the end of lines no longer hang R
  (tidyverse/readr#1309)

* Comment lines containing unpaired quotes are no longer treated as
  unterminated quotations
  (tidyverse/readr#1307)

* Values with only a `Inf` or `NaN` prefix but additional data
  afterwards, like `Inform` or no longer inappropriately guessed as
  doubles (tidyverse/readr#1319)

* Time types now support `%h` format to denote hour durations greater
  than 24, like readr (tidyverse/readr#1312)

* Fix performance issue when materializing subsetted vectors (#378)


# vroom 1.5.5

* `vroom()` now supports files with only carriage return newlines
  (`\r`). (#360, tidyverse/readr#1236)

* `vroom()` now parses single digit datetimes more consistently as
  readr has done (tidyverse/readr#1276)

* `vroom()` now parses `Inf` values as doubles
  (tidyverse/readr#1283)

* `vroom()` now parses `NaN` values as doubles
  (tidyverse/readr#1277)

* `VROOM_CONNECTION_SIZE` is now parsed as a double, which supports
  scientific notation (#364)

* `vroom()` now works around specifying a `\n` as the delimiter (#365,
  tidyverse/dplyr#5977)

* `vroom()` no longer crashes if given a `col_name` and `col_type`
  both less than the number of columns
  (tidyverse/readr#1271)

* `vroom()` no longer hangs if given an empty value for
  `locale(grouping_mark=)`
  (tidyverse/readr#1241)

* Fix performance regression when guessing with large numbers of rows
  (tidyverse/readr#1267)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

8 participants