-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv coerces all column values to NA if 100 first observations are missing #128
Comments
FWIW, this actually happens a lot with the type of data I tend to work with. It is also not practical to manually specify column types in my particular application, since hundreds of columns raised a warning. |
It's not possible to look for the first 100 non-missing values, because that doesn't take a bounded amount of time to run - it might have to scan the whole file. |
I had not thought of that. I guess I'll just keep using |
Oh hmmmm, I think the reason that this is so painful is that I have a bug in my logic somewhere - if the first 100 values are all missing, it should guess that the column is character, since that ensures you don't lose info |
Makes sense. Also, you probably don't want to see a proliferation of arguments, but since the 100 number is arbitrary, it might be useful to allow users to specify how many rows the function checks. For example, I'd be willing to waste a few cpu cycles to check 1000 lines and get good type inference. |
Seems duplicated #124. Also |
The major annoyingness of this behaviour should be fixed - now all contents will be loaded without errors into a character vector. I'll continue to explore better heuristics for guessing column type. |
If the first 100 observations of a variable are missing,
read_csv
overwrites all column values with NAs (unless the column is boolean). Ideally, column type should be determined using the first 100 non-missing values of each column.The text was updated successfully, but these errors were encountered: