Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: should we relax frame/setitem for non-numerics #3037

Closed
jreback opened this issue Mar 13, 2013 · 0 comments
Closed

ENH: should we relax frame/setitem for non-numerics #3037

jreback opened this issue Mar 13, 2013 · 0 comments

Comments

@jreback
Copy link
Contributor

jreback commented Mar 13, 2013

I came across the case where if you try setitem with a boolean mask on a frame
that is mixed dtype an exception is raised

this is easily relaxed in the int/float case (and will leave/upcast the int columns as needed)

In [68]: df
Out[68]: 
     0         1         2   3         4  y
35 NaN       NaN       NaN NaN  0.342153  0
40 NaN  0.326323       NaN NaN       NaN  0
43 NaN       NaN  0.290126 NaN       NaN  0
49 NaN  0.326323       NaN NaN       NaN  0
50 NaN  0.391147       NaN NaN       NaN  1

In [75]: df.dtypes
Out[75]: 
0    float64
1    float64
2    float64
3    float64
4    float64
y      int64

This will currently raise because its mixed_type (this is easily fixed and I think should be,
as the IntBlock will upcast if needed)

In [72]: df[df>0.3] = 1

In [73]: df
Out[73]: 
     0   1         2   3   4  y
35 NaN NaN       NaN NaN   1  0
40 NaN   1       NaN NaN NaN  0
43 NaN NaN  0.290126 NaN NaN  0
49 NaN   1       NaN NaN NaN  0
50 NaN   1       NaN NaN NaN  1

What about a mixed type that invovles non-numerics though,

In [77]: df
Out[77]: 
     0         1         2   3         4  y   foo
35 NaN       NaN       NaN NaN  0.342153  0  test
40 NaN  0.326323       NaN NaN       NaN  0  test
43 NaN       NaN  0.290126 NaN       NaN  0  test
49 NaN  0.326323       NaN NaN       NaN  0  test
50 NaN  0.391147       NaN NaN       NaN  1  test

In [78]: df.get_dtype_counts()
Out[78]: 
float64    5
int64      1
object     1

Should raise here? or allow just the non-numerics to 'work'?

am leaning toward allowing a purely numeric frame to work (e.g. mixed int/float),
but raising on this last case? (then its explicity that you did something 'wrong')

any opinons?

Note that the getitem case works on mixed....

n [80]: df[df>0.3]
Out[80]: 
      0          1    2    3          4    y   foo
35  NaN        NaN  NaN  NaN  0.3421533  NaN  test
40  NaN  0.3263232  NaN  NaN        NaN  NaN  test
43  NaN        NaN  NaN  NaN        NaN  NaN  test
49  NaN  0.3263232  NaN  NaN        NaN  NaN  test
50  NaN  0.3911472  NaN  NaN        NaN    1  test

and this would preclude a pathological case where (and maybe this is another bug),
you can fillna this and it doesn't convert to float64, so it 'looks' like a numeric but actually isn't
Note: I am letting this go thru, (e.g. only try the numeric case if the mixed type fails, more
for backward compatibilty that anything else)

In [94]: df = DataFrame({"col1": [2, 5.0, 123, None],
   ....:                         "col2": [1, 2, 3, 4]}, dtype=object)

In [95]: df
Out[95]: 
   col1 col2
0     2    1
1     5    2
2   123    3
3  None    4

In [96]: df.dtypes
Out[96]: 
col1    object
col2    object
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant