read_csv() incorrectly reads character vectors if all strings begin with "Inf" (e.g. "Inform") #1319

sdevine188 · 2021-10-22T17:32:27Z

Thank you for the excellent readr package! I think I'm seeing unexpected behavior from read_csv() though. When trying to read a tibble containing a character vector of strings that all begin with "Inf" (e.g. "Inform", "Information"), read_csv() incorrectly reads it as a numeric Inf, instead of correctly reading it as a string. The base read.csv() correctly reads it as a string though. If the character vector contains at least one string that does not begin with "Inf" however (e.g. "Indigo"), then read_csv() will correctly read the vector as a string. Read_csv() will also correctly read the vector as a string if the col_types argument specifies it as a character vector, but that requires manual checks/edits.

It seems problematic to have to continually check all character vectors first and then manually specify col_types if all the strings happen to begin with "Inf". Is it possible to please update read_csv() to handle these kind of vectors in the same way as read.csv()?

Thanks very much, and apologies if I'm just missing something. (also posted on Stack Overflow: https://stackoverflow.com/questions/69680431/r-readrread-csv-incorrectly-reads-character-vectors-if-all-strings-begin-with)

suppressPackageStartupMessages(library(tidyverse))

#################################################

# save tibble with character vector containing only strings that begin with "Inf"
test_1 <- tibble(x = c("Inform", "Information"))
test_1 %>% glimpse()
#> Rows: 2
#> Columns: 1
#> $ x <chr> "Inform", "Information"
test_1 %>% write_csv(file = "test_1.csv")

# read_csv() seems to convert the strings into numeric Inf because they all begin with "Inf"
# however, if col_types is manually specified as col_character, then read_csv() correctly reads the vector as a string
read_csv(file = "test_1.csv")
#> Rows: 2 Columns: 1
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> dbl (1): x
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 x 1
#>       x
#>   <dbl>
#> 1   Inf
#> 2   Inf
read_csv(file = "test_1.csv", lazy = FALSE)
#> Rows: 2 Columns: 1
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> dbl (1): x
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 x 1
#>       x
#>   <dbl>
#> 1   Inf
#> 2   Inf
read_csv(file = "test_1.csv", col_types = cols(x = col_character()))
#> # A tibble: 2 x 1
#>   x          
#>   <chr>      
#> 1 Inform     
#> 2 Information

# read.csv() correctly reads the vector as a string
read.csv(file = "test_1.csv") %>% glimpse()
#> Rows: 2
#> Columns: 1
#> $ x <chr> "Inform", "Information"

# read_csv() correctly reads similar character vectors if they contain at least one string that does not begin with "Inf"
test_2 <- tibble(x = c("Inform", "Indigo", "Information")) %>% write_csv(file = "test_2.csv")
read_csv(file = "test_2.csv") %>% glimpse()
#> Rows: 3 Columns: 1
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (1): x
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 3
#> Columns: 1
#> $ x <chr> "Inform", "Indigo", "Information"

#################################################

# get version info
packageVersion("tidyverse")
#> [1] '1.3.1'
version
#>                _                           
#> platform       x86_64-w64-mingw32          
#> arch           x86_64                      
#> os             mingw32                     
#> system         x86_64, mingw32             
#> status                                     
#> major          4                           
#> minor          1.1                         
#> year           2021                        
#> month          08                          
#> day            10                          
#> svn rev        80725                       
#> language       R                           
#> version.string R version 4.1.1 (2021-08-10)
#> nickname       Kick Things

^{Created on 2021-10-22 by the reprex package (v2.0.1)}

jimhester · 2021-11-05T14:50:34Z

Thank you for opening the issue and for supplying a reproducible example, it is a big help!

This should be fixed in the next release of vroom.

# vroom 1.5.7 * Jenny Bryan is now the official maintainer. * Fix uninitialized bool detected by CRAN's UBSAN check (tidyverse/vroom#386) * Fix buffer overflow when trying to parse an integer field that is over 64 characters long (tidyverse/readr#1326) * Fix subset indexing when indexes span a file boundary multiple times (#383) # vroom 1.5.6 * `vroom(col_select=)` now works if `col_names = FALSE` as intended (#381) * `vroom(n_max=)` now correctly handles cases when reading from a connection and the file does _not_ end with a newline (tidyverse/readr#1321) * `vroom()` no longer issues a spurious warning when the parsing needs * to be restarted due to the presence of embedded newlines * (tidyverse/readr#1313) Fix performance * issue when materializing subsetted vectors (#378) * `vroom_format()` now uses the same internal multi-threaded code as `vroom_write()`, improving its performance in most cases (#377) * `vroom_fwf()` no longer omits the last line if it does _not_ end with a newline (tidyverse/readr#1293) * Empty files or files with only a header line and no data no longer cause a crash if read with multiple files (tidyverse/readr#1297) * Files with a header but no contents, or a empty file if `col_names = FALSE` no longer cause a hang when `progress = TRUE` (tidyverse/readr#1297) * Commented lines with comments at the end of lines no longer hang R (tidyverse/readr#1309) * Comment lines containing unpaired quotes are no longer treated as unterminated quotations (tidyverse/readr#1307) * Values with only a `Inf` or `NaN` prefix but additional data afterwards, like `Inform` or no longer inappropriately guessed as doubles (tidyverse/readr#1319) * Time types now support `%h` format to denote hour durations greater than 24, like readr (tidyverse/readr#1312) * Fix performance issue when materializing subsetted vectors (#378) # vroom 1.5.5 * `vroom()` now supports files with only carriage return newlines (`\r`). (#360, tidyverse/readr#1236) * `vroom()` now parses single digit datetimes more consistently as readr has done (tidyverse/readr#1276) * `vroom()` now parses `Inf` values as doubles (tidyverse/readr#1283) * `vroom()` now parses `NaN` values as doubles (tidyverse/readr#1277) * `VROOM_CONNECTION_SIZE` is now parsed as a double, which supports scientific notation (#364) * `vroom()` now works around specifying a `\n` as the delimiter (#365, tidyverse/dplyr#5977) * `vroom()` no longer crashes if given a `col_name` and `col_type` both less than the number of columns (tidyverse/readr#1271) * `vroom()` no longer hangs if given an empty value for `locale(grouping_mark=)` (tidyverse/readr#1241) * Fix performance regression when guessing with large numbers of rows (tidyverse/readr#1267)

jimhester added the bug an unexpected problem or unintended behavior label Nov 5, 2021

jimhester closed this as completed in tidyverse/vroom@7fc1a1d Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv() incorrectly reads character vectors if all strings begin with "Inf" (e.g. "Inform") #1319

read_csv() incorrectly reads character vectors if all strings begin with "Inf" (e.g. "Inform") #1319

sdevine188 commented Oct 22, 2021

jimhester commented Nov 5, 2021

read_csv() incorrectly reads character vectors if all strings begin with "Inf" (e.g. "Inform") #1319

read_csv() incorrectly reads character vectors if all strings begin with "Inf" (e.g. "Inform") #1319

Comments

sdevine188 commented Oct 22, 2021

jimhester commented Nov 5, 2021