-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Odd behavior from read_csv with na_values set to non-string values #3611
Comments
@rhstanton give a try with master, this should be fixed (via #3615) |
Thanks! The behavior in the last two cases is now consistent. However, using 999 does not register as na when the input value is (float) 999.0, and vice versa. I’m not really sure what should happen here, or even if this should work at all instead of flagging an error (should this insist on string values?). If it is supposed to work with numeric instead of string values (which I’d be quite happy with), wouldn’t it then make sense for (say) 999 to work with both integer and float inputs equal to 999? After all, if I set i = 999, then both i == 999 and i == 999.0 return True. From: jreback [mailto:notifications@github.com] @rhstantonhttps://github.com/rhstanton give a try with master, this should be fixed (via #3615#3615) — |
999 and 999.0 are different strings which is how this matched u can different values per column as well (pass a dict) |
‘999’ and ‘999.0’ are different strings, but in these examples we’re passing numbers, not strings, and 999 == 999.0 returns True. If it’s going to work at all passing numbers, I think my expectation would be for the results to be based on a numerical comparison, not a string comparison. Thanks for the dict tip. That could be useful. From: jreback [mailto:notifications@github.com] 999 and 999.0 are different strings which is how this matched u can different values per column as well (pass a dict) — |
keep in mind that this is an exact string match, these don't have dtypes at this point. that said, you have a valid point, so you are proposing adding the float versions of ints and vice versa if specified in na_vaues, pls open an enhancement issue for this note you can do this: read in with no na match for 999,999.0
will work for ints and floats (you can of course column restrict) |
This is a problem for me as well, despite having a very recent build of Pandas. In my dataset,
however, this value was not picked up as being NaN:
I have tried using a float and an integer, but neither worked. Also, specifying the columns for replacement with a dict does not work. Here is the dataset that I am using. Also, the solution you suggest above does not work if even a single column is of a different type:
So, it appears that the only solution is to manually check which columns contain the value, and replace them in a loop. |
If you pass |
Unfortunately not. |
This was a bug (introduced when I fixed this originally) also, slight API change, na_values will now take a value like 88 (as a int), and match '88',88,88.0 (string/numeric dtypes) in whatever the resulting dtype of the column @fonnesbeck it seems natural to read this dataset with @fonnesbeck pls try this out if you can (via the PR)
|
Sorry, the original file included a patient ID in column 0, which I had to remove before sharing. I will try the fix. |
It is still failing for me, unfortunately:
|
did you try this PR? (its not in master) |
I just merged it...try master now |
It sure helps when I build the right code. Sorry. Works for me now, yes. |
great! |
I like the sound of this edit, but it ether doesn't quite work or I'm misinterpreting what its supposed to do. Here's an example, where I want -9988 to denote missing data, and where the data file, test.csv, contains both '-9988' and '-9988.0': Field1,Field2,Field3 The command read_csv('data/test.csv', header=0, sep=',', na_values=[-9988]) yields the output
0 NaN -9988 5.6 |
Are you on latest master?
|
I downloaded the latest master this morning, so it should be up to date. I may just be misunderstanding how this is supposed to work. I only passed the integer version of the missing data indicator (999 in your example), assuming this would also match against the float value (999.0) in my csv file (e.g., 999.0), but it doesn't seem to. How is it supposed to work? From: jreback <notifications@git.luolix.topmailto:notifications@github.com> Are you on latest master? In [9]: pd.version In [4]: path = 'test.csv' In [5]: df = DataFrame({'A' : [-999, 2, 3], 'B' : [1.2, -999, 4.5]}) In [10]: df In [6]: df.to_csv(path, sep=' ', index=False) In [7]: read_csv(path, sep= ' ', header=0, na_values=['-999.0','-999']) In [8]: read_csv(path, sep= ' ', header=0, na_values=[-999,-999.0]) — |
hmmm....I think this is not working like it should.... |
@rhstanton if you would try this PR would be appreciated I believe I have fixed the issue which basically was this. Putting 999 (as an int) should have generated na_values of 999, and 999.0 (as string) as well as 999 as an in , and 999.0 as a float and the same values should be generated if you pass 999 (as an int) I think its fixed now as fi you pass any of |
I'd be happy to do so, but how do you use a PR? I've only pulled the master repository in the past. From: jreback <notifications@git.luolix.topmailto:notifications@github.com> @rhstantonhttps://github.com/rhstanton if you would try this PR would be appreciated I believe I have fixed the issue which basically was this. Putting 999 (as an int) should have generated na_values of 999, and 999.0 (as string) as well as 999 as an in , and 999.0 as a float and the same values should be generated if you pass 999 (as an int) I think its fixed now as fi you pass any of [999,999.0,'999','999.0'] will get the same end result (which I think it right) — |
very similar to how you use master:
then test out from that dir (you can rename the final GH3611_2 if you want) |
Thanks. It seems to work fine if the number in the csv file has only a single zero after the decimal point, but (as long as I really am using the right version now...) fails if you add an extra zero (i.e., make it 999.00 in the csv file). In this case, even 999.00 fails. The only way to get this one to work is to explicitly include '999.00' |
yep....not exactly sure what to do about that; you would have to specifiy it exactly in that case; however a float will normally be written as '999.0' so cover that case...(or using an integer) which is 999 |
the basic issue is we are doing string matching as the column hasn't been converted (its still a string), e.g. you could have say the string |
@wesm any thoughts here? |
Yup. Would it be a pain to do the matching after conversion? From: jreback <notifications@git.luolix.topmailto:notifications@github.com> the basic issue is we are doing string matching as the column hasn't been converted (its still a string), e.g. you could have say the string NaN or whatever embeded, so can't convert it yet; the other way to deal with this is to NOT specify na values, then do something like: df[df==999.0] = np.nan which is a float conversion — |
What about using regular expressions in the string matching (maybe this is already done...)? That would allow me to specify (say) 999 followed by an arbitrary number of zeros. From: jreback <notifications@git.luolix.topmailto:notifications@github.com> @wesmhttps://github.com/wesm any thoughts here? — |
these are hash set matched done via c-code |
@rhstanton alrgiht....update that branch
have to test for floats in a slightly different way, but |
That all now seems to work as expected. Just one thought, though. Shouldn't we treat string and numeric values differently? If I pass the number 999, it makes absolute sense for it to match anything that Python would regard (after parsing) as being equal to that number, including 999, 999.0, 999.00, etc. However, if I pass a string, doesn't it make more sense for the match to be exact? So if I pass (string) '999', this should match against '999', but not against '999.0', etc. After all, if I set na_values = 'ABC', it would be very counterintuitive for this to match against 'ABC.000' From: jreback <notifications@git.luolix.topmailto:notifications@github.com> @rhstantonhttps://github.com/rhstanton alrgiht....update that branch git pull...build again and try have to test for floats in a slightly different way, but — |
ths only will happen for floats, thus 'ABC.000' would not be a matching value if you can construct a case that would be great, but I think passing |
I guessed that 'ABC.000' would not match! I really just meant to use it to |
ok...great..mergine this in |
Hi @jreback , for some reason, I still see this type of odd behavior in pandas 2.2.1 - but based on this thread this issue has been fixed 11 years ago. Before I saw this thread, I posted a question and @rhshadrach gave me some suggestions. It seems like your changes on the _stringify_na_values() in the commit #3841 are not included anymore. I built the version 3.0.0.dev0+1320.gd093fae3cd. Any comments or suggestions @jreback ? |
read_csv behaves oddly when na_values is set to non-string values. Sometimes
it correctly replaces the assigned number with NaN, and sometimes it doesn't. Here are some examples. Note in particular the different behavior of the last two statements:
Create file
df = DataFrame({'A' : [-999, 2, 3], 'B' : [1.2, -999, 4.5]})
df.to_csv('test2.csv', sep=' ', index=False)
print read_csv('test2.csv', sep= ' ', header=0, na_values=[-999])
A B
0 NaN 1.2
1 2 -999.0
2 3 4.5
print read_csv('test2.csv', sep= ' ', header=0, na_values=[-999.0])
A B
0 -999 1.2
1 2 NaN
2 3 4.5
print read_csv('test2.csv', sep= ' ', header=0, na_values=[-999.0,-999])
A B
0 -999 1.2
1 2 NaN
2 3 4.5
print read_csv('test2.csv', sep= ' ', header=0, na_values=[-999,-999.0])
A B
0 NaN 1.2
1 2 -999.0
2 3 4.5
The text was updated successfully, but these errors were encountered: