Odd behavior from read_csv with na_values set to non-string values #3611

rhstanton · 2013-05-15T15:47:37Z

read_csv behaves oddly when na_values is set to non-string values. Sometimes
it correctly replaces the assigned number with NaN, and sometimes it doesn't. Here are some examples. Note in particular the different behavior of the last two statements:

Create file

df = DataFrame({'A' : [-999, 2, 3], 'B' : [1.2, -999, 4.5]})
df.to_csv('test2.csv', sep=' ', index=False)

print read_csv('test2.csv', sep= ' ', header=0, na_values=[-999])

A B
0 NaN 1.2
1 2 -999.0

2 3 4.5

print read_csv('test2.csv', sep= ' ', header=0, na_values=[-999.0])

A B
0 -999 1.2
1 2 NaN

2 3 4.5

print read_csv('test2.csv', sep= ' ', header=0, na_values=[-999.0,-999])

A B
0 -999 1.2
1 2 NaN

2 3 4.5

print read_csv('test2.csv', sep= ' ', header=0, na_values=[-999,-999.0])

A B
0 NaN 1.2
1 2 -999.0

2 3 4.5

jreback · 2013-05-16T01:02:06Z

@rhstanton give a try with master, this should be fixed (via #3615)

rhstanton · 2013-05-16T03:52:36Z

Thanks! The behavior in the last two cases is now consistent.

However, using 999 does not register as na when the input value is (float) 999.0, and vice versa. I’m not really sure what should happen here, or even if this should work at all instead of flagging an error (should this insist on string values?). If it is supposed to work with numeric instead of string values (which I’d be quite happy with), wouldn’t it then make sense for (say) 999 to work with both integer and float inputs equal to 999? After all, if I set i = 999, then both

i == 999

and

i == 999.0

return True.

From: jreback [mailto:notifications@github.com]
Sent: Wednesday, May 15, 2013 6:02 PM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

@rhstantonhttps://github.com/rhstanton give a try with master, this should be fixed (via #3615 #3615)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-17975374.

jreback · 2013-05-16T04:00:40Z

999 and 999.0 are different strings which is how this matched

u can different values per column as well (pass a dict)

rhstanton · 2013-05-16T04:08:40Z

‘999’ and ‘999.0’ are different strings, but in these examples we’re passing numbers, not strings, and

999 == 999.0

returns True. If it’s going to work at all passing numbers, I think my expectation would be for the results to be based on a numerical comparison, not a string comparison.

Thanks for the dict tip. That could be useful.

From: jreback [mailto:notifications@github.com]
Sent: Wednesday, May 15, 2013 9:01 PM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

999 and 999.0 are different strings which is how this matched

u can different values per column as well (pass a dict)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-17980206.

jreback · 2013-05-16T09:55:17Z

keep in mind that this is an exact string match, these don't have dtypes at this point. that said, you have a valid point, so you are proposing adding the float versions of ints and vice versa if specified in na_vaues, pls open an enhancement issue for this

note you can do this:

read in with no na match for 999,999.0

df[df==999] == np.nan

will work for ints and floats (you can of course column restrict)

fonnesbeck · 2013-06-05T03:49:18Z

This is a problem for me as well, despite having a very recent build of Pandas. In my dataset, 88 is used as a missing value. Hence, I used the following:

dataset = data_file.parse("Sheet1", index_col=0, na_values=[88])

however, this value was not picked up as being NaN:

dataset.diarrhea.value_counts()

0     282
1     197
88      8
dtype: int64

I have tried using a float and an integer, but neither worked. Also, specifying the columns for replacement with a dict does not work.

Here is the dataset that I am using.

Also, the solution you suggest above does not work if even a single column is of a different type:

dataset[dataset==88] = np.nan

ValueError: Cannot do boolean setting on mixed-type frame

So, it appears that the only solution is to manually check which columns contain the value, and replace them in a loop.

wesm · 2013-06-05T04:41:27Z

If you pass na_values=['88'] does it work?

fonnesbeck · 2013-06-05T04:44:24Z

Unfortunately not.

jreback · 2013-06-05T12:02:57Z

This was a bug (introduced when I fixed this originally)

also, slight API change, na_values will now take a value like 88 (as a int), and match '88',88,88.0 (string/numeric dtypes) in whatever the resulting dtype of the column

@fonnesbeck it seems natural to read this dataset with index_col=None

@fonnesbeck pls try this out if you can (via the PR)

In [5]: pd.read_excel('test.xlsx',"Sheet1", index_col=0, na_values=['88'])
Out[5]: 
<class 'pandas.core.frame.DataFrame'>
Index: 490 entries, 1.0 to 1.0
Data columns (total 36 columns):
race         490  non-null values
ethn         484  non-null values
stcol        484  non-null values
rota         393  non-null values
blood        487  non-null values
saliva       484  non-null values
diarrhea     479  non-null values
qod          160  non-null values
abdpain      333  non-null values
vomit        483  non-null values
dos          478  non-null values
travel       16  non-null values
sickcon      231  non-null values
noc          486  non-null values
type         490  non-null values
abx          28  non-null values
dtp          337  non-null values
fever        472  non-null values
feverval     112  non-null values
pathogens    160  non-null values
futuse       488  non-null values
ecoli        486  non-null values
amc          490  non-null values
am           490  non-null values
cz           490  non-null values
fep          490  non-null values
caz          490  non-null values
cro          490  non-null values
cxm          490  non-null values
cip          490  non-null values
gm           490  non-null values
mem          490  non-null values
tzp          490  non-null values
stx          490  non-null values
case         490  non-null values
month        490  non-null values
dtypes: float64(36)

In [6]: pd.read_excel('test.xlsx',"Sheet1", index_col=0, na_values=['88.0']).head()
Out[6]: 
     race  ethn  stcol  rota  blood  saliva  diarrhea  qod  abdpain  vomit  dos  travel  sickcon  noc  type  abx  dtp  fever  feverval  pathogens  futuse  ecoli  amc  am  cz  fep  caz  cro  cxm  cip  gm  mem  tzp  stx  case  month
sex                                                                                                                                                                                                                                   
1       1     1      1     0      0       1         1  NaN        0      0    4     NaN      NaN    1     0  NaN    8      0       NaN        NaN       1      1    1   2   1    1    1    1    1    1   1    1    1    1     1     12
0       2     2      1     0      0       1         0  NaN      NaN      1    3     NaN        1    0     0  NaN  NaN      1       NaN        NaN       0      0    0   0   0    0    0    0    0    0   0    0    0    0     1     12
0       2     2      0   NaN      0       2         0  NaN        1      1    4     NaN      NaN    0     0  NaN  NaN      0       NaN        NaN       1      0    0   0   0    0    0    0    0    0   0    0    0    0     1      0
1       3     2      1     0      0       1         1    0      NaN      1    4     NaN        1    0     0  NaN  NaN      0       NaN        NaN       1      0    0   0   0    0    0    0    0    0   0    0    0    0     1     12
1       1     1      1     0      0       1         0  NaN        1      1    2     NaN        0    0     0  NaN    3      1       NaN        NaN       0      0    0   0   0    0    0    0    0    0   0    0    0    0     1     12

fonnesbeck · 2013-06-05T12:29:41Z

Sorry, the original file included a patient ID in column 0, which I had to remove before sharing. I will try the fix.

fonnesbeck · 2013-06-05T13:49:06Z

It is still failing for me, unfortunately:

In[11]: pd.read_excel("Ecoli1.xlsx", "Sheet1", na_values=[88]).travel.value_counts()

Out[11]:
88    471
0      14
1       2
dtype: int64

jreback · 2013-06-05T14:01:54Z

did you try this PR? (its not in master)

jreback · 2013-06-05T14:03:00Z

I just merged it...try master now

fonnesbeck · 2013-06-05T14:12:04Z

It sure helps when I build the right code. Sorry. Works for me now, yes.

jreback · 2013-06-05T14:14:54Z

great!
thanks for catching these bugs!
reclosing

rhstanton · 2013-06-10T17:42:25Z

I like the sound of this edit, but it ether doesn't quite work or I'm misinterpreting what its supposed to do. Here's an example, where I want -9988 to denote missing data, and where the data file, test.csv, contains both '-9988' and '-9988.0':

Field1,Field2,Field3
-9988,-9988.0,5.6

The command

read_csv('data/test.csv', header=0, sep=',', na_values=[-9988])

yields the output

Field1  Field2  Field3

0 NaN -9988 5.6

jreback · 2013-06-10T18:07:29Z

Are you on latest master?

In [9]: pd.__version__
Out[9]: '0.11.1.dev-2b7efc3'

In [4]: path = 'test.csv'

In [5]: df = DataFrame({'A' : [-999, 2, 3], 'B' : [1.2, -999, 4.5]})

In [10]: df
Out[10]: 
     A      B
0 -999    1.2
1    2 -999.0
2    3    4.5

In [6]: df.to_csv(path, sep=' ', index=False)

In [7]: read_csv(path, sep= ' ', header=0, na_values=['-999.0','-999'])
Out[7]: 
    A    B
0 NaN  1.2
1   2  NaN
2   3  4.5

In [8]: read_csv(path, sep= ' ', header=0, na_values=[-999,-999.0])
Out[8]: 
    A    B
0 NaN  1.2
1   2  NaN
2   3  4.5

rhstanton · 2013-06-10T18:25:06Z

I downloaded the latest master this morning, so it should be up to date. I may just be misunderstanding how this is supposed to work. I only passed the integer version of the missing data indicator (999 in your example), assuming this would also match against the float value (999.0) in my csv file (e.g., 999.0), but it doesn't seem to. How is it supposed to work?

From: jreback <notifications@github.com mailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.github.com mailto:reply@reply.github.com>
Date: Monday, June 10, 2013 11:07 AM
To: pydata/pandas <pandas@noreply.github.com mailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edu mailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

Are you on latest master?

In [9]: pd.version
Out[9]: '0.11.1.dev-2b7efc3'

In [4]: path = 'test.csv'

In [5]: df = DataFrame({'A' : [-999, 2, 3], 'B' : [1.2, -999, 4.5]})

In [10]: df
Out[10]:
A B
0 -999 1.2
1 2 -999.0
2 3 4.5

In [6]: df.to_csv(path, sep=' ', index=False)

In [7]: read_csv(path, sep= ' ', header=0, na_values=['-999.0','-999'])
Out[7]:
A B
0 NaN 1.2
1 2 NaN
2 3 4.5

In [8]: read_csv(path, sep= ' ', header=0, na_values=[-999,-999.0])
Out[8]:
A B
0 NaN 1.2
1 2 NaN
2 3 4.5

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-19216006.

jreback · 2013-06-10T18:31:06Z

hmmm....I think this is not working like it should....

jreback · 2013-06-10T20:42:17Z

@rhstanton if you would try this PR would be appreciated

I believe I have fixed the issue which basically was this. Putting 999 (as an int) should have generated na_values of 999, and 999.0 (as string) as well as 999 as an in , and 999.0 as a float

and the same values should be generated if you pass 999 (as an int)

I think its fixed now as fi you pass any of [999,999.0,'999','999.0'] will get the same end result (which I think it right)

rhstanton · 2013-06-10T23:13:04Z

I'd be happy to do so, but how do you use a PR? I've only pulled the master repository in the past.

From: jreback <notifications@github.com mailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.github.com mailto:reply@reply.github.com>
Date: Monday, June 10, 2013 1:42 PM
To: pydata/pandas <pandas@noreply.github.com mailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edu mailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

@rhstantonhttps://github.com/rhstanton if you would try this PR would be appreciated

I believe I have fixed the issue which basically was this. Putting 999 (as an int) should have generated na_values of 999, and 999.0 (as string) as well as 999 as an in , and 999.0 as a float

and the same values should be generated if you pass 999 (as an int)

I think its fixed now as fi you pass any of [999,999.0,'999','999.0'] will get the same end result (which I think it right)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-19224949.

jreback · 2013-06-10T23:20:26Z

very similar to how you use master:

git clone -b GH3611_2 https://github.com/jreback/pandas.git GH3611_2
cd GH3611_2
python setup.py build_ext --inplace

then test out from that dir (you can rename the final GH3611_2 if you want)

rhstanton · 2013-06-10T23:39:25Z

Thanks. It seems to work fine if the number in the csv file has only a single zero after the decimal point, but (as long as I really am using the right version now...) fails if you add an extra zero (i.e., make it 999.00 in the csv file). In this case, even 999.00 fails. The only way to get this one to work is to explicitly include '999.00'

jreback · 2013-06-10T23:44:06Z

yep....not exactly sure what to do about that; you would have to specifiy it exactly in that case; however a float will normally be written as '999.0' so cover that case...(or using an integer) which is 999

jreback · 2013-06-10T23:45:52Z

the basic issue is we are doing string matching as the column hasn't been converted (its still a string), e.g. you could have say the string NaN or whatever embeded, so can't convert it yet; the other way to deal with this is to NOT specify na values, then do something like: df[df==999.0] = np.nan which is a float conversion

jreback · 2013-06-10T23:46:03Z

@wesm any thoughts here?

rhstanton · 2013-06-10T23:51:34Z

Yup. Would it be a pain to do the matching after conversion?

From: jreback <notifications@github.com mailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.github.com mailto:reply@reply.github.com>
Date: Monday, June 10, 2013 4:46 PM
To: pydata/pandas <pandas@noreply.github.com mailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edu mailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

the basic issue is we are doing string matching as the column hasn't been converted (its still a string), e.g. you could have say the string NaN or whatever embeded, so can't convert it yet; the other way to deal with this is to NOT specify na values, then do something like: df[df==999.0] = np.nan which is a float conversion

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-19233408.

rhstanton · 2013-06-11T00:00:37Z

What about using regular expressions in the string matching (maybe this is already done...)? That would allow me to specify (say) 999 followed by an arbitrary number of zeros.

From: jreback <notifications@github.com mailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.github.com mailto:reply@reply.github.com>
Date: Monday, June 10, 2013 4:46 PM
To: pydata/pandas <pandas@noreply.github.com mailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edu mailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

@wesmhttps://github.com/wesm any thoughts here?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-19233417.

jreback · 2013-06-11T00:02:47Z

these are hash set matched done via c-code
I think I know away

jreback · 2013-06-11T13:41:38Z

@rhstanton alrgiht....update that branch

git pull...build again and try

have to test for floats in a slightly different way, but
now passing any of 999,999.0,'999.0','999' should match string/float/ints that are 999, which
I think is what 'makes' sense

rhstanton · 2013-06-11T17:11:19Z

That all now seems to work as expected.

Just one thought, though. Shouldn't we treat string and numeric values differently? If I pass the number 999, it makes absolute sense for it to match anything that Python would regard (after parsing) as being equal to that number, including 999, 999.0, 999.00, etc. However, if I pass a string, doesn't it make more sense for the match to be exact? So if I pass (string) '999', this should match against '999', but not against '999.0', etc. After all, if I set na_values = 'ABC', it would be very counterintuitive for this to match against 'ABC.000'

From: jreback <notifications@github.com mailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.github.com mailto:reply@reply.github.com>
Date: Tuesday, June 11, 2013 6:41 AM
To: pydata/pandas <pandas@noreply.github.com mailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edu mailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

@rhstantonhttps://github.com/rhstanton alrgiht....update that branch

git pull...build again and try

have to test for floats in a slightly different way, but
now passing any of 999,999.0,'999.0','999' should match string/float/ints that are 999, which
I think is what 'makes' sense

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-19262144.

jreback · 2013-06-11T17:21:54Z

ths only will happen for floats, thus 'ABC.000' would not be a matching value

if you can construct a case that would be great, but I think passing [999,999.0,'999','999.0'] should be treated identically

rhstanton · 2013-06-11T17:30:01Z

I guessed that 'ABC.000' would not match! I really just meant to use it to
point out that when people match strings, they usually really want to
match the string. On the other hand, especially given the history of using
strings for na_values, it is probably true that when people pass '999.0'
as a string here, they're thinking of it as a number, so I'm quite happy
to leave this where it is. Thanks!

jreback · 2013-06-11T17:43:07Z

ok...great..mergine this in

sshu2017 · 2024-08-11T04:25:53Z

Hi @jreback , for some reason, I still see this type of odd behavior in pandas 2.2.1 - but based on this thread this issue has been fixed 11 years ago.

Before I saw this thread, I posted a question and @rhshadrach gave me some suggestions.

#59303

It seems like your changes on the _stringify_na_values() in the commit #3841 are not included anymore. I built the version 3.0.0.dev0+1320.gd093fae3cd. Any comments or suggestions @jreback ?

jreback mentioned this issue May 15, 2013

BUG: (GH3611) Fix read_csv to correctly encode identical na_values #3615

Merged

jreback closed this as completed in #3615 May 16, 2013

wesm reopened this Jun 5, 2013

jreback mentioned this issue Jun 5, 2013

BUG: (GH3611) revisited; read_excel not passing thru options to ExcelFile.parse #3758

Merged

jreback closed this as completed Jun 5, 2013

jreback reopened this Jun 10, 2013

jreback mentioned this issue Jun 10, 2013

BUG: GH3611 fix again, float na_values were not stringified correctly #3841

Merged

jreback closed this as completed in #3841 Jun 11, 2013

Odd behavior from read_csv with na_values set to non-string values #3611

Odd behavior from read_csv with na_values set to non-string values #3611

Comments

rhstanton commented May 15, 2013

Create file

2 3 4.5

2 3 4.5

2 3 4.5

2 3 4.5

jreback commented May 16, 2013

rhstanton commented May 16, 2013

jreback commented May 16, 2013

rhstanton commented May 16, 2013

jreback commented May 16, 2013

fonnesbeck commented Jun 5, 2013

wesm commented Jun 5, 2013

fonnesbeck commented Jun 5, 2013

jreback commented Jun 5, 2013

fonnesbeck commented Jun 5, 2013

fonnesbeck commented Jun 5, 2013

jreback commented Jun 5, 2013

jreback commented Jun 5, 2013

fonnesbeck commented Jun 5, 2013

jreback commented Jun 5, 2013

rhstanton commented Jun 10, 2013

jreback commented Jun 10, 2013

rhstanton commented Jun 10, 2013

jreback commented Jun 10, 2013

jreback commented Jun 10, 2013

rhstanton commented Jun 10, 2013

jreback commented Jun 10, 2013

rhstanton commented Jun 10, 2013

jreback commented Jun 10, 2013

jreback commented Jun 10, 2013

jreback commented Jun 10, 2013

rhstanton commented Jun 10, 2013

rhstanton commented Jun 11, 2013

jreback commented Jun 11, 2013

jreback commented Jun 11, 2013

rhstanton commented Jun 11, 2013

jreback commented Jun 11, 2013

rhstanton commented Jun 11, 2013

jreback commented Jun 11, 2013

sshu2017 commented Aug 11, 2024