read_csv's na_values dict format cannot parse float type #12224

cboettig · 2016-02-03T17:21:54Z

Minor issue regarding read_csv's na_values argument in dict format. I note that the list format works fine when the NA value is given as a float-type (which is often the intuitive choice), e.g.:

co2 = pd.read_csv("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt", 
                 comment = "#", delim_whitespace = True,
                names = ["year", "month", "decimal_date", "average", "interpolated", "trend", "days"],
                na_values =[-99.99, -1])

However, the dict format is more appropriate for this classic data set, since different columns are defining different NA values. Unfortunately, this fails with an error about float type:

co2 = pd.read_csv("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt", 
                 comment = "#", delim_whitespace = True,
                names = ["year", "month", "decimal_date", "average", "interpolated", "trend", "days"],
                na_values = {"decimal_date" : -99.99, "days" : -1})

and the NA value must be given as a string; which feels all kinds of wrong here:

co2 = pd.read_csv("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt", 
                 comment = "#", delim_whitespace = True,
                names = ["year", "month", "decimal_date", "average", "interpolated", "trend", "days"],
                na_values = {"decimal_date" : "-99.99", "days" : "-1"})

Thanks for all the pandas awesomeness,

The text was updated successfully, but these errors were encountered:

BranYang · 2016-02-04T02:07:36Z

Tried the errored call:

In [4]: co2 = pd.read_csv("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_m
lo.txt",
   ...:                  comment = "#", delim_whitespace = True,
   ...:                 names = ["year", "month", "decimal_date", "average", "in
terpolated", "trend", "days"],
   ...:                 na_values = {"decimal_date" : -99.99, "days" : -1})
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

....
truncated by bran
....

pandas\io\parsers.py in _clean_na_values(na_values,
keep_default_na)
   2210         if keep_default_na:
   2211             for k, v in compat.iteritems(na_values):
-> 2212                 v = set(list(v)) | _NA_VALUES
   2213                 na_values[k] = v
   2214         na_fvalues = dict([

TypeError: 'int' object is not iterable

Seems that read_csv assumes the values in the na_values dict (something like{key:values}) must be an iterable.
To solve this, maybe we should either accept single value in the dict as well, or make it clear in the docstring.

Current docstring for na_values

na_values : str, list-like or dict, default None
    Additional strings to recognize as NA/NaN. If dict passed, specific
    per-column NA values

I'd be happy to submit a PR for this.

jreback · 2016-02-10T22:03:59Z

yeah this is an inconsistency.

separately I don't recall why we are defined (different!) _NA_VALUES in parser.pyx and io.parser.py. I think this is a mistake. It might have been added accidently, as this makes different na values defaults for python/c parser.

gfyoung · 2016-08-21T02:45:26Z

Here's a nice reproducible example:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = '1,2\n2,1'
>>> read_csv(StringIO(data), names=['a', 'b'], na_values={'a': 2, 'b': 1})
...
TypeError: 'int' object is not iterable

Note that this works, however:

>>> read_csv(StringIO(data), names=['a', 'b'], na_values=1)
     a    b
0  NaN  2.0
1  2.0  NaN

jreback · 2016-08-21T02:48:30Z

this works if u pass a dict of lists

gfyoung · 2016-08-21T02:53:51Z

@jreback : Agreed, but if we accept scalars, we should accept them in the dict for consistency.

Update documentation to state that scalars are accepted for na_values. In addition, accept scalars for the values when a dictionary is passed in for na_values. Closes pandas-devgh-12224.

Update documentation to state that scalars are accepted for na_values. In addition, accept scalars for the values when a dictionary is passed in for na_values. Closes gh-12224.

neilser · 2018-01-07T16:57:08Z

I'm confused - I have 0.22.0 but I still get the "not iterable" error if I pass a dict for na_values with a numeric scalar value in it. What can be doing this?

gfyoung · 2018-01-07T18:45:16Z

Can you provide an example to reproduce this?

neilser · 2018-01-08T13:04:13Z

Yes - see my other comment at the bottom of this issue: #1657
(~~I don't know how to link to the actual comment.~~ Direct link to comment: #1657 (comment)
Btw, I'm such a github noob that I'm unsure if I should have just opened a new issue, or commented in previous issues addressing the same problem; I chose the latter...)

When I issue the dict version of the read_csv from that comment, but remove the quotes from one or more of the numeric values, such that this code

df = pd.read_csv("hack.csv", header=None, 
                 keep_default_na=False, na_values={2:"",6:"214.008",1:"blah",0:'113125'})

changes to (say) this code:

df = pd.read_csv("hack.csv", header=None, 
                 keep_default_na=False, na_values={2:"",6:214.008,1:"blah",0:'113125'})

then I immediately get the "not iterable" error.

gfyoung · 2018-01-08T17:18:45Z

@neilser : How odd! This indeed slipped through the cracks on this one. Please open another issue with this example and your multiple attempts to read it. There is definitely something buggy.

neilser · 2018-01-08T23:14:59Z

Sure, will create a new issue when I get a moment (probly tomorrow). Thanks for the feedback :-)

jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO CSV read_csv, to_csv Bug API Design Difficulty Intermediate and removed API Design labels Feb 10, 2016

jreback added this to the Next Major Release milestone Feb 10, 2016

gfyoung mentioned this issue Aug 21, 2016

BUG, DOC: Fix inconsistencies with scalar na_values in read_csv #14056

Merged

jorisvandenbossche closed this as completed in #14056 Aug 21, 2016

jorisvandenbossche modified the milestones: 0.19.0, Next Major Release Aug 21, 2016

neilser mentioned this issue Jan 13, 2018

read_csv issues with dict for na_values #19227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv's na_values dict format cannot parse float type #12224

read_csv's na_values dict format cannot parse float type #12224

cboettig commented Feb 3, 2016

BranYang commented Feb 4, 2016

jreback commented Feb 10, 2016

gfyoung commented Aug 21, 2016

jreback commented Aug 21, 2016

gfyoung commented Aug 21, 2016

neilser commented Jan 7, 2018

gfyoung commented Jan 7, 2018

neilser commented Jan 8, 2018 •

edited

Loading

gfyoung commented Jan 8, 2018

neilser commented Jan 8, 2018

read_csv's na_values dict format cannot parse float type #12224

read_csv's na_values dict format cannot parse float type #12224

Comments

cboettig commented Feb 3, 2016

BranYang commented Feb 4, 2016

jreback commented Feb 10, 2016

gfyoung commented Aug 21, 2016

jreback commented Aug 21, 2016

gfyoung commented Aug 21, 2016

neilser commented Jan 7, 2018

gfyoung commented Jan 7, 2018

neilser commented Jan 8, 2018 • edited Loading

gfyoung commented Jan 8, 2018

neilser commented Jan 8, 2018

neilser commented Jan 8, 2018 •

edited

Loading