-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More consistent na_values handling in read_csv #1657
Comments
I agree that it's too confusing/inconsistent as it stands. I added custom NA handling for a previous issue but I should have revisited the whole NA specifying API for read_csv. The fix will be part of 0.8.2 coming out in a few weeks. Thanks for the bug report! |
I added a keep_default_na keyword |
Hey there - I believe this issue is still not completely fixed. Reading in a csv with countries and using both
|
@kayvonr what version of pandas? |
http://pandas.pydata.org/pandas-docs/dev/io.html#na-values
|
pandas 0.12.0 I originally tried it with a list argument first and that presents two different problems:
An argument with any non-empty string causes the "NA" to be read in as an empty string.
Double checked it wasn't just the space with
and an empty string results in the "NA" string being again turned into a NaN |
Issue also present with a dictionary argument to
|
this was fixed by #4374, it in 0.13 |
ah ok, thanks |
Kinda funny, I just ran into a similar issue as @kayvonr with Namibia/NA, it just wasn't obvious to me that |
Probably changing the default behavior to be less magic is not an option anymore, but maybe there is a documentation improvement ticket in here somewhere. |
@makmanalp I mean the docs are pretty clear. If you want to add an example would take that. Of course people just don't read the docs...... |
I can't get a dictionary of
Two variants of my code - the one with the list does what I expect, but the dict version doesn't: df = pd.read_csv("hack.csv", header=None, keep_default_na=False, na_values=[214.008,'',"blah"])
df.head() output:
but the dict version: df = pd.read_csv("hack.csv", header=None,
keep_default_na=False, na_values={2:"",6:"214.008",1:"blah",0:'113125'})
df.head() is paying attention to the columns I specify, and then simply refusing to create any NaNs in those columns:
So... I'm stuck. Any suggestions? I really want to have column-specific NaN handling so I need the dict. |
The current handling of the
na_values
argument toread_csv
is strangely different depending on what kind of value you pass tona_values
. If you pass None, the default NA values are used. If you pass a dict mapping column names to values, then those values will be used for those columns, totally overriding the default NA values, while for columns not in the dict, the default values will be used. If you pass some other kind of iterable, it uses the union of the passed values and the default values as the NA values.This behavior is confusing because sometimes the passed values override the defaults, but other times they just add to the defaults. It's also contrary to the documentation at http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files, which says: "If you pass an empty list or an empty list for a particular column, no values (including empty strings) will be considered NA." But passing an empty list doesn't result in no values being considered NA. In fact, passing an empty list does nothing, since the empty list is unioned with the default NA values, so the default NA values are just used anyway.
Currently there is no easy way to pass a list of NA values which overrides the default for all columns. You can pass a dict, but then you have to specify the defaults per column. If you pass a list, you're not overriding the defaults, you're adding to them. This makes for confusing behavior when reading CSV files with string data in which strings like "na" and "nan" are valid data and should be read as their literal string values.
There should be a way to pass an all-column set of NA values that overrides the defaults. One possibility would be to have two arguments, something like
all_na_values
andmore_na_values
, to specify overriding and additional values, respectively. Another possibility would be to expose the default (currently the module-level_NA_VALUES
in parsers.py), and allow users to add to it it they want to add more NA values (e.g.,read_csv(na_values=set(['newNA']) | pandas.default_nas)
.The text was updated successfully, but these errors were encountered: