Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv() incorrectly reads character vectors if all strings begin with "Inf" (e.g. "Inform") #1319

Closed
sdevine188 opened this issue Oct 22, 2021 · 1 comment
Labels
bug an unexpected problem or unintended behavior

Comments

@sdevine188
Copy link

Thank you for the excellent readr package! I think I'm seeing unexpected behavior from read_csv() though. When trying to read a tibble containing a character vector of strings that all begin with "Inf" (e.g. "Inform", "Information"), read_csv() incorrectly reads it as a numeric Inf, instead of correctly reading it as a string. The base read.csv() correctly reads it as a string though. If the character vector contains at least one string that does not begin with "Inf" however (e.g. "Indigo"), then read_csv() will correctly read the vector as a string. Read_csv() will also correctly read the vector as a string if the col_types argument specifies it as a character vector, but that requires manual checks/edits.

It seems problematic to have to continually check all character vectors first and then manually specify col_types if all the strings happen to begin with "Inf". Is it possible to please update read_csv() to handle these kind of vectors in the same way as read.csv()?

Thanks very much, and apologies if I'm just missing something. (also posted on Stack Overflow: https://stackoverflow.com/questions/69680431/r-readrread-csv-incorrectly-reads-character-vectors-if-all-strings-begin-with)

suppressPackageStartupMessages(library(tidyverse))

#################################################

# save tibble with character vector containing only strings that begin with "Inf"
test_1 <- tibble(x = c("Inform", "Information"))
test_1 %>% glimpse()
#> Rows: 2
#> Columns: 1
#> $ x <chr> "Inform", "Information"
test_1 %>% write_csv(file = "test_1.csv")

# read_csv() seems to convert the strings into numeric Inf because they all begin with "Inf"
# however, if col_types is manually specified as col_character, then read_csv() correctly reads the vector as a string
read_csv(file = "test_1.csv")
#> Rows: 2 Columns: 1
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> dbl (1): x
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 x 1
#>       x
#>   <dbl>
#> 1   Inf
#> 2   Inf
read_csv(file = "test_1.csv", lazy = FALSE)
#> Rows: 2 Columns: 1
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> dbl (1): x
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 x 1
#>       x
#>   <dbl>
#> 1   Inf
#> 2   Inf
read_csv(file = "test_1.csv", col_types = cols(x = col_character()))
#> # A tibble: 2 x 1
#>   x          
#>   <chr>      
#> 1 Inform     
#> 2 Information

# read.csv() correctly reads the vector as a string
read.csv(file = "test_1.csv") %>% glimpse()
#> Rows: 2
#> Columns: 1
#> $ x <chr> "Inform", "Information"

# read_csv() correctly reads similar character vectors if they contain at least one string that does not begin with "Inf"
test_2 <- tibble(x = c("Inform", "Indigo", "Information")) %>% write_csv(file = "test_2.csv")
read_csv(file = "test_2.csv") %>% glimpse()
#> Rows: 3 Columns: 1
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (1): x
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 3
#> Columns: 1
#> $ x <chr> "Inform", "Indigo", "Information"

#################################################

# get version info
packageVersion("tidyverse")
#> [1] '1.3.1'
version
#>                _                           
#> platform       x86_64-w64-mingw32          
#> arch           x86_64                      
#> os             mingw32                     
#> system         x86_64, mingw32             
#> status                                     
#> major          4                           
#> minor          1.1                         
#> year           2021                        
#> month          08                          
#> day            10                          
#> svn rev        80725                       
#> language       R                           
#> version.string R version 4.1.1 (2021-08-10)
#> nickname       Kick Things

Created on 2021-10-22 by the reprex package (v2.0.1)

@jimhester jimhester added the bug an unexpected problem or unintended behavior label Nov 5, 2021
@jimhester
Copy link
Collaborator

Thank you for opening the issue and for supplying a reproducible example, it is a big help!

This should be fixed in the next release of vroom.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue May 1, 2022
# vroom 1.5.7

* Jenny Bryan is now the official maintainer.

* Fix uninitialized bool detected by CRAN's UBSAN check
  (tidyverse/vroom#386)

* Fix buffer overflow when trying to parse an integer field that is
  over 64 characters long
  (tidyverse/readr#1326)

* Fix subset indexing when indexes span a file boundary multiple times
  (#383)

# vroom 1.5.6

* `vroom(col_select=)` now works if `col_names = FALSE` as intended (#381)

* `vroom(n_max=)` now correctly handles cases when reading from a
  connection and the file does _not_ end with a newline
  (tidyverse/readr#1321)

* `vroom()` no longer issues a spurious warning when the parsing needs
* to be restarted due to the presence of embedded newlines
* (tidyverse/readr#1313) Fix performance
* issue when materializing subsetted vectors (#378)

* `vroom_format()` now uses the same internal multi-threaded code as
  `vroom_write()`, improving its performance in most cases (#377)

* `vroom_fwf()` no longer omits the last line if it does _not_ end
  with a newline (tidyverse/readr#1293)

* Empty files or files with only a header line and no data no longer
  cause a crash if read with multiple files
  (tidyverse/readr#1297)

* Files with a header but no contents, or a empty file if `col_names =
  FALSE` no longer cause a hang when `progress = TRUE`
  (tidyverse/readr#1297)

* Commented lines with comments at the end of lines no longer hang R
  (tidyverse/readr#1309)

* Comment lines containing unpaired quotes are no longer treated as
  unterminated quotations
  (tidyverse/readr#1307)

* Values with only a `Inf` or `NaN` prefix but additional data
  afterwards, like `Inform` or no longer inappropriately guessed as
  doubles (tidyverse/readr#1319)

* Time types now support `%h` format to denote hour durations greater
  than 24, like readr (tidyverse/readr#1312)

* Fix performance issue when materializing subsetted vectors (#378)


# vroom 1.5.5

* `vroom()` now supports files with only carriage return newlines
  (`\r`). (#360, tidyverse/readr#1236)

* `vroom()` now parses single digit datetimes more consistently as
  readr has done (tidyverse/readr#1276)

* `vroom()` now parses `Inf` values as doubles
  (tidyverse/readr#1283)

* `vroom()` now parses `NaN` values as doubles
  (tidyverse/readr#1277)

* `VROOM_CONNECTION_SIZE` is now parsed as a double, which supports
  scientific notation (#364)

* `vroom()` now works around specifying a `\n` as the delimiter (#365,
  tidyverse/dplyr#5977)

* `vroom()` no longer crashes if given a `col_name` and `col_type`
  both less than the number of columns
  (tidyverse/readr#1271)

* `vroom()` no longer hangs if given an empty value for
  `locale(grouping_mark=)`
  (tidyverse/readr#1241)

* Fix performance regression when guessing with large numbers of rows
  (tidyverse/readr#1267)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants