read_csv coerces all column values to NA if 100 first observations are missing #128

vincentarelbundock · 2015-04-13T15:33:23Z

If the first 100 observations of a variable are missing, read_csv overwrites all column values with NAs (unless the column is boolean). Ideally, column type should be determined using the first 100 non-missing values of each column.

library(readr)
x = data.frame(matrix(rnorm(1000), ncol=5))
x$X1[1:100] = NA
write.csv(x, file='test.csv', row.names=FALSE)
y = read_csv('test.csv')
y$X1
problems(x)

The text was updated successfully, but these errors were encountered:

vincentarelbundock · 2015-04-13T15:41:25Z

FWIW, this actually happens a lot with the type of data I tend to work with. It is also not practical to manually specify column types in my particular application, since hundreds of columns raised a warning.

hadley · 2015-04-13T18:31:42Z

It's not possible to look for the first 100 non-missing values, because that doesn't take a bounded amount of time to run - it might have to scan the whole file.

vincentarelbundock · 2015-04-13T18:34:54Z

I had not thought of that. I guess I'll just keep using read.csv for now. Thanks for your work.

hadley · 2015-04-13T20:04:57Z

Oh hmmmm, I think the reason that this is so painful is that I have a bug in my logic somewhere - if the first 100 values are all missing, it should guess that the column is character, since that ensures you don't lose info

vincentarelbundock · 2015-04-14T12:36:06Z

Makes sense. Also, you probably don't want to see a proliferation of arguments, but since the 100 number is arbitrary, it might be useful to allow users to specify how many rows the function checks. For example, I'd be willing to waste a few cpu cycles to check 1000 lines and get good type inference.

artemklevtsov · 2015-04-15T12:08:13Z

Seems duplicated #124. Also readxl have the same bug.

hadley · 2015-04-16T21:40:07Z

The major annoyingness of this behaviour should be fixed - now all contents will be loaded without errors into a character vector. I'll continue to explore better heuristics for guessing column type.

hadley closed this as completed in 1f352e5 Apr 16, 2015

lock bot locked and limited conversation to collaborators Sep 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv coerces all column values to NA if 100 first observations are missing #128

read_csv coerces all column values to NA if 100 first observations are missing #128

vincentarelbundock commented Apr 13, 2015

vincentarelbundock commented Apr 13, 2015

hadley commented Apr 13, 2015

vincentarelbundock commented Apr 13, 2015

hadley commented Apr 13, 2015

vincentarelbundock commented Apr 14, 2015

artemklevtsov commented Apr 15, 2015

hadley commented Apr 16, 2015

read_csv coerces all column values to NA if 100 first observations are missing #128

read_csv coerces all column values to NA if 100 first observations are missing #128

Comments

vincentarelbundock commented Apr 13, 2015

vincentarelbundock commented Apr 13, 2015

hadley commented Apr 13, 2015

vincentarelbundock commented Apr 13, 2015

hadley commented Apr 13, 2015

vincentarelbundock commented Apr 14, 2015

artemklevtsov commented Apr 15, 2015

hadley commented Apr 16, 2015