Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv's na_values dict format cannot parse float type #12224

Closed
cboettig opened this issue Feb 3, 2016 · 10 comments
Closed

read_csv's na_values dict format cannot parse float type #12224

cboettig opened this issue Feb 3, 2016 · 10 comments
Labels
Bug IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@cboettig
Copy link

cboettig commented Feb 3, 2016

Minor issue regarding read_csv's na_values argument in dict format. I note that the list format works fine when the NA value is given as a float-type (which is often the intuitive choice), e.g.:

co2 = pd.read_csv("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt", 
                 comment = "#", delim_whitespace = True,
                names = ["year", "month", "decimal_date", "average", "interpolated", "trend", "days"],
                na_values =[-99.99, -1])

However, the dict format is more appropriate for this classic data set, since different columns are defining different NA values. Unfortunately, this fails with an error about float type:

co2 = pd.read_csv("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt", 
                 comment = "#", delim_whitespace = True,
                names = ["year", "month", "decimal_date", "average", "interpolated", "trend", "days"],
                na_values = {"decimal_date" : -99.99, "days" : -1})

and the NA value must be given as a string; which feels all kinds of wrong here:

co2 = pd.read_csv("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt", 
                 comment = "#", delim_whitespace = True,
                names = ["year", "month", "decimal_date", "average", "interpolated", "trend", "days"],
                na_values = {"decimal_date" : "-99.99", "days" : "-1"})

Thanks for all the pandas awesomeness,

@BranYang
Copy link
Contributor

BranYang commented Feb 4, 2016

Tried the errored call:

In [4]: co2 = pd.read_csv("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_m
lo.txt",
   ...:                  comment = "#", delim_whitespace = True,
   ...:                 names = ["year", "month", "decimal_date", "average", "in
terpolated", "trend", "days"],
   ...:                 na_values = {"decimal_date" : -99.99, "days" : -1})
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

....
truncated by bran
....

pandas\io\parsers.py in _clean_na_values(na_values,
keep_default_na)
   2210         if keep_default_na:
   2211             for k, v in compat.iteritems(na_values):
-> 2212                 v = set(list(v)) | _NA_VALUES
   2213                 na_values[k] = v
   2214         na_fvalues = dict([

TypeError: 'int' object is not iterable

Seems that read_csv assumes the values in the na_values dict (something like{key:values}) must be an iterable.
To solve this, maybe we should either accept single value in the dict as well, or make it clear in the docstring.

Current docstring for na_values

na_values : str, list-like or dict, default None
    Additional strings to recognize as NA/NaN. If dict passed, specific
    per-column NA values

I'd be happy to submit a PR for this.

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO CSV read_csv, to_csv Bug API Design Difficulty Intermediate and removed API Design labels Feb 10, 2016
@jreback jreback added this to the Next Major Release milestone Feb 10, 2016
@jreback
Copy link
Contributor

jreback commented Feb 10, 2016

yeah this is an inconsistency.

separately I don't recall why we are defined (different!) _NA_VALUES in parser.pyx and io.parser.py. I think this is a mistake. It might have been added accidently, as this makes different na values defaults for python/c parser.

@gfyoung
Copy link
Member

gfyoung commented Aug 21, 2016

Here's a nice reproducible example:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = '1,2\n2,1'
>>> read_csv(StringIO(data), names=['a', 'b'], na_values={'a': 2, 'b': 1})
...
TypeError: 'int' object is not iterable

Note that this works, however:

>>> read_csv(StringIO(data), names=['a', 'b'], na_values=1)
     a    b
0  NaN  2.0
1  2.0  NaN

@jreback
Copy link
Contributor

jreback commented Aug 21, 2016

this works if u pass a dict of lists

@gfyoung
Copy link
Member

gfyoung commented Aug 21, 2016

@jreback : Agreed, but if we accept scalars, we should accept them in the dict for consistency.

gfyoung added a commit to forking-repos/pandas that referenced this issue Aug 21, 2016
Update documentation to state that scalars are accepted for
na_values. In addition, accept scalars for the values when a
dictionary is passed in for na_values.

Closes pandas-devgh-12224.
jorisvandenbossche pushed a commit that referenced this issue Aug 21, 2016
Update documentation to state that scalars are accepted for
na_values. In addition, accept scalars for the values when a
dictionary is passed in for na_values.

Closes gh-12224.
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0, Next Major Release Aug 21, 2016
@neilser
Copy link

neilser commented Jan 7, 2018

I'm confused - I have 0.22.0 but I still get the "not iterable" error if I pass a dict for na_values with a numeric scalar value in it. What can be doing this?

@gfyoung
Copy link
Member

gfyoung commented Jan 7, 2018

Can you provide an example to reproduce this?

@neilser
Copy link

neilser commented Jan 8, 2018

Yes - see my other comment at the bottom of this issue: #1657
(I don't know how to link to the actual comment. Direct link to comment: #1657 (comment)
Btw, I'm such a github noob that I'm unsure if I should have just opened a new issue, or commented in previous issues addressing the same problem; I chose the latter...)

When I issue the dict version of the read_csv from that comment, but remove the quotes from one or more of the numeric values, such that this code

df = pd.read_csv("hack.csv", header=None, 
                 keep_default_na=False, na_values={2:"",6:"214.008",1:"blah",0:'113125'})

changes to (say) this code:

df = pd.read_csv("hack.csv", header=None, 
                 keep_default_na=False, na_values={2:"",6:214.008,1:"blah",0:'113125'})

then I immediately get the "not iterable" error.

@gfyoung
Copy link
Member

gfyoung commented Jan 8, 2018

@neilser : How odd! This indeed slipped through the cracks on this one. Please open another issue with this example and your multiple attempts to read it. There is definitely something buggy.

@neilser
Copy link

neilser commented Jan 8, 2018

Sure, will create a new issue when I get a moment (probly tomorrow). Thanks for the feedback :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

6 participants