skip bug in read_lines and read_fwf if unpaired quotes in the data #991

juangomezduaso · 2019-04-15T11:26:17Z

Functions read_lines() and read_fwf() don't behave correctly wrt skip parameter if there are unpaired double quotes (") in the data.

library(readr)
#> Warning: package 'readr' was built under R version 3.5.3
data <-
"a\"b
cde
f\"g
hij"
read_fwf(data, fwf_widths(1:2, LETTERS[1:2]), skip=1)
#> # A tibble: 1 x 2
#>   A     B    
#>   <chr> <chr>
#> 1 h     ij
read_lines(data, skip=1)
#> [1] "hij"

^{Created on 2019-04-15 by the reprex package (v0.2.1)}

The text was updated successfully, but these errors were encountered:

juangomezduaso · 2019-04-15T11:44:43Z

This might be the cause of issue #986

dan-reznik · 2019-04-15T12:08:09Z

ouch, that makes sense. we do agree that for consistency read_lines() should not care about any characters (quotes or otherwise) present in the input stream, except CR-LF, correct? note: those "quote" field separators are relevant for field-oriented readers (read_csv, read_delim), via the "quote" parameter, but shouldn't be for read_lines() or read_file(). the authors probably inadvertently reused some code, hopefully they will be kind enough to review this in time, as read_lines() is a major workhorse for most people trying to investigate structural problems with files.

juangomezduaso · 2019-04-15T12:14:07Z

Yes. And I think the authors wll do as well, because this just happens with skip, and when reading all lines there is no problem and these "enquoted" newlines are considered as line separators

dan-reznik · 2019-04-15T12:21:49Z

i guess one workaround is to simply substitute all quotes for some other unusual char in a file prior to pushing it into read_lines():

read_file("file.txt") %>%
   str_replace_all(fixed('"'),"^") %>%
   read_lines(skip=xxx, n_max=yyy) %>%
   ...

however this defeats the main use of read_lines() w/ a skip and an n_max: the entire file (potentially large) will not be brought into memory. if loading the entire file were acceptable, read_lines() could simply become this:

read_file("file.txt") %>%
   str_split("\\r\\n") %>%
   first %>%
   tail(-skip) %>%
   head(n_max) %>%
   ...

cheers

juangomezduaso · 2019-04-15T12:30:19Z

Cumprimentos

joachim-gassen · 2020-04-20T07:58:59Z

Hi there. Came here to file a new issue as this has bit me quite badly when working on a huge data import project using either vroom() or read_*. The issue is also present when using, e.g., read_tsv() with quote = "". See below.

library(readr)

# I am on 1.3.1 - same behavior with current Github version and on Mac as well as Ubuntu 18.04

dta <- c(
  "a\tb\tc",
  "row 1 a\trow 1 b\trow 1 c",
  "row 2 a\trow 2 b\trow 2 c\"",
  "row 3 a\trow 3 b\trow 3 c\"",
  "row 4 a\trow 4 b\trow 4 c"
)

tmp_file <- tempfile("repex", fileext = ".tsv")
writeLines(dta, tmp_file)

# all good
read_tsv(tmp_file, quote = "")

# should start reading in data row 2 and it does
read_tsv(tmp_file, col_names = c("a", "b", "c"), skip = 2, quote = "")

# should start reading in data row 3 but it reads row 4 instead
read_tsv(tmp_file, col_names = c("a", "b", "c"), skip = 3, quote = "")

# Same for read_lines
read_lines(tmp_file, skip = 3)

# ...and for vroom
library(vroom)
vroom(tmp_file, col_names = c("a", "b", "c"), delim = "\t", skip = 3, quote = "")

# Good old base works....
read.delim(tmp_file, quote = "", header = FALSE, col.names = c("a", "b", "c"), skip = 3)

This causes all sorts of issues in user land as in many cases skip will silently skip an inconsistent number of data lines because of this bug. Any pointers on how to fix this? I am back to using base now but I would rather use {vroom} for speed...

juangomezduaso · 2021-01-04T13:16:04Z

@jimhester:
I think that elf86e5 doesn't solve this issue for read_fwf().
It seems to me that it would be easy to do by just appending " skip_quote = FALSE" in the two calls to datasource() that read_fwf() makes, but I am not confident enought to do a pull request, sorry.

We only want to try to find embedded newlines when skipping lines if our format uses quoting. Otherwise we don't want to check for quoted newlines when skipping Fixes tidyverse/readr#991 (comment)

juangomezduaso mentioned this issue Apr 15, 2019

read_lines() w/ non-zero skip gets out of sync for longer files. #986

Closed

jimhester added the bug an unexpected problem or unintended behavior label May 3, 2019

jimhester closed this as completed in e1f86e5 Sep 14, 2020

estroger34 mentioned this issue Feb 22, 2021

skip bug in read_table2 if unpaired quotes in data #1180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skip bug in read_lines and read_fwf if unpaired quotes in the data #991

skip bug in read_lines and read_fwf if unpaired quotes in the data #991

juangomezduaso commented Apr 15, 2019 •

edited

Loading

juangomezduaso commented Apr 15, 2019

dan-reznik commented Apr 15, 2019 •

edited

Loading

juangomezduaso commented Apr 15, 2019

dan-reznik commented Apr 15, 2019 •

edited

Loading

juangomezduaso commented Apr 15, 2019

joachim-gassen commented Apr 20, 2020

juangomezduaso commented Jan 4, 2021 •

edited

Loading

skip bug in read_lines and read_fwf if unpaired quotes in the data #991

skip bug in read_lines and read_fwf if unpaired quotes in the data #991

Comments

juangomezduaso commented Apr 15, 2019 • edited Loading

juangomezduaso commented Apr 15, 2019

dan-reznik commented Apr 15, 2019 • edited Loading

juangomezduaso commented Apr 15, 2019

dan-reznik commented Apr 15, 2019 • edited Loading

juangomezduaso commented Apr 15, 2019

joachim-gassen commented Apr 20, 2020

juangomezduaso commented Jan 4, 2021 • edited Loading

juangomezduaso commented Apr 15, 2019 •

edited

Loading

dan-reznik commented Apr 15, 2019 •

edited

Loading

dan-reznik commented Apr 15, 2019 •

edited

Loading

juangomezduaso commented Jan 4, 2021 •

edited

Loading