Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv coerces all column values to NA if 100 first observations are missing #128

Closed
vincentarelbundock opened this issue Apr 13, 2015 · 7 comments

Comments

@vincentarelbundock
Copy link

If the first 100 observations of a variable are missing, read_csv overwrites all column values with NAs (unless the column is boolean). Ideally, column type should be determined using the first 100 non-missing values of each column.

library(readr)
x = data.frame(matrix(rnorm(1000), ncol=5))
x$X1[1:100] = NA
write.csv(x, file='test.csv', row.names=FALSE)
y = read_csv('test.csv')
y$X1
problems(x)
@vincentarelbundock
Copy link
Author

FWIW, this actually happens a lot with the type of data I tend to work with. It is also not practical to manually specify column types in my particular application, since hundreds of columns raised a warning.

@hadley
Copy link
Member

hadley commented Apr 13, 2015

It's not possible to look for the first 100 non-missing values, because that doesn't take a bounded amount of time to run - it might have to scan the whole file.

@vincentarelbundock
Copy link
Author

I had not thought of that. I guess I'll just keep using read.csv for now. Thanks for your work.

@hadley
Copy link
Member

hadley commented Apr 13, 2015

Oh hmmmm, I think the reason that this is so painful is that I have a bug in my logic somewhere - if the first 100 values are all missing, it should guess that the column is character, since that ensures you don't lose info

@vincentarelbundock
Copy link
Author

Makes sense. Also, you probably don't want to see a proliferation of arguments, but since the 100 number is arbitrary, it might be useful to allow users to specify how many rows the function checks. For example, I'd be willing to waste a few cpu cycles to check 1000 lines and get good type inference.

@artemklevtsov
Copy link

Seems duplicated #124. Also readxl have the same bug.

@hadley hadley closed this as completed in 1f352e5 Apr 16, 2015
@hadley
Copy link
Member

hadley commented Apr 16, 2015

The major annoyingness of this behaviour should be fixed - now all contents will be loaded without errors into a character vector. I'll continue to explore better heuristics for guessing column type.

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants