Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/Internals: maybe_promote #23833

Open
h-vetinari opened this issue Nov 21, 2018 · 2 comments
Open

BUG/Internals: maybe_promote #23833

h-vetinari opened this issue Nov 21, 2018 · 2 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions

Comments

@h-vetinari
Copy link
Contributor

Seems I found a pretty deep rabbit hole while trying to solve #23823 (while trying to solve #23192 / #23604):

maybe_upcast_putmask and maybe_promote are both completely untested (or at least, their names do not appear anywhere in pandas/tests/), and maybe_promote also does not have a docstring. Side note: ran into a segfault while trying to remove some old numpy compat code from that method in #23796.

Aside from missing a docstring and tests, the behaviour is also false, at least regarding integer types:

>>> import numpy as np
>>> from pandas.core.dtypes.cast import maybe_promote
>>> maybe_promote(np.dtype('int8'), np.array([10, np.iinfo('int8').max + 1, 12]))
(<class 'numpy.float64'>, nan)

To me, this should clearly upcast to int16 instead of float (using arrays for fill_value is correct usage, as done e.g. in maybe_upcast_putmask as maybe_promote(result.dtype, other), and has a dedicated code branch in maybe_promote).

In int-to-int promotion, the question is what to return as an actual fill_value though. Of course, this method is being used in pretty central code paths, but the number of uses is not that high (on master; half of the instances are imports/redefinitions).

pandas/core\algorithms.py:12:    maybe_promote, construct_1d_object_array_from_listlike)
pandas/core\algorithms.py:1572:        _maybe_promote to determine this type for any fill_value
pandas/core\algorithms.py:1617:            dtype, fill_value = maybe_promote(arr.dtype, fill_value)
pandas/core\algorithms.py:1700:            dtype, fill_value = maybe_promote(arr.dtype, fill_value)
pandas/core\dtypes\cast.py:228:        new_dtype, _ = maybe_promote(result.dtype, other)
pandas/core\dtypes\cast.py:252:def maybe_promote(dtype, fill_value=np.nan):
pandas/core\dtypes\cast.py:538:        new_dtype, fill_value = maybe_promote(dtype, fill_value)
pandas/core\generic.py:34:from pandas.core.dtypes.cast import maybe_promote, maybe_upcast_putmask
pandas/core\generic.py:8289:                            dtype, fill_value = maybe_promote(other.dtype)
pandas/core\indexes\base.py:3371:        pself, ptarget = self._maybe_promote(target)
pandas/core\indexes\base.py:3505:        pself, ptarget = self._maybe_promote(target)
pandas/core\indexes\base.py:3528:    def _maybe_promote(self, other):
pandas/core\indexes\datetimes.py:924:    def _maybe_promote(self, other):
pandas/core\indexes\timedeltas.py:409:    def _maybe_promote(self, other):
pandas/core\internals\blocks.py:45:    maybe_promote,
pandas/core\internals\blocks.py:899:            dtype, _ = maybe_promote(arr_value.dtype)
pandas/core\internals\blocks.py:1054:                    dtype, _ = maybe_promote(n.dtype)
pandas/core\internals\blocks.py:3174:        dtype, fill_value = maybe_promote(values.dtype)
pandas/core\internals\blocks.py:3293:    dtype, _ = maybe_promote(n.dtype)
pandas/core\internals\concat.py:19:from pandas.core.dtypes.cast import maybe_promote
pandas/core\internals\concat.py:137:            return _get_dtype(maybe_promote(self.block.dtype,
pandas/core\internals\managers.py:22:    maybe_promote,
pandas/core\internals\managers.py:1277:                    _, fill_value = maybe_promote(blk.dtype)
pandas/core\reshape\reshape.py:12:from pandas.core.dtypes.cast import maybe_promote
pandas/core\reshape\reshape.py:192:            dtype, fill_value = maybe_promote(values.dtype, self.fill_value)

Therefore it might make sense to adapt the private API, e.g. adding a kwarg must_hold_na and/or return_default_na. I've inspected all the occurrences of the code above, and this would not be a problem to implement.

Once I get around to it, will probably split this into two PRs, one just for adding tests/docstring, and one to change...

@gfyoung gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Nov 21, 2018
@gfyoung
Copy link
Member

gfyoung commented Nov 21, 2018

Sounds like a plan. Go for it!

@h-vetinari
Copy link
Contributor Author

An extract of the bugs found by #23982. They are not closed by that PR, but should be solved in a follow-up.

>>> import numpy as np
>>> from pandas.core.dtypes.cast import maybe_promote
>>>
>>> # should be int16, not int32
>>> maybe_promote(np.dtype('int8'), np.iinfo('int8').max + 1)
(dtype('int32'), 128)
>>>
>>> # should be object, not raise
>>> maybe_promote(np.dtype(int), np.iinfo('uint64').max + 1)
Traceback (most recent call last):
[...]
OverflowError: Python int too large to convert to C long
>>>
>>> # should stay signed, not switch to unsigned
>>> maybe_promote(np.dtype('uint8'), np.iinfo('uint8').max + 1)
(dtype('int32'), 256)
>>>
>>> # should cast to int16, not int32
>>> maybe_promote(np.dtype('uint8'), np.iinfo('int8').min - 1)
(dtype('int32'), -129)
>>> 
>>> # should stay int
>>> maybe_promote(np.dtype('int64'), np.array([1]))
(<class 'numpy.float64'>, nan)
>>>
>>> # should upcast to object, not float
>>> maybe_promote(np.dtype('int64'), np.array([np.iinfo('int64').max + 1]))
(<class 'numpy.float64'>, nan)
>>>
>>> # should only upcast to float32
>>> maybe_promote(np.dtype('int8'), np.array([1], dtype='float32'))
(<class 'numpy.float64'>, nan)
>>>
>>> # should upcast to float64
>>> maybe_promote(np.dtype('float32'), np.finfo('float32').max * 1.1)
(dtype('float32'), 3.7431058130238175e+38)
>>>
>>> # should only upcast to complex64, not complex128
>>> maybe_promote(np.dtype('float32'), 1 + 1j)
(<class 'numpy.complex128'>, (1+1j))
>>>
>>> # should not upcast
>>> maybe_promote(np.dtype('bool'), np.array([True]))
(<class 'numpy.object_'>, nan)
>>>
>>> # should still return nan, not iNaT
>>> maybe_promote(np.dtype('bool'), np.array([1], dtype=np.dtype('datetime64[ns]')))
(<class 'numpy.object_'>, -9223372036854775808)
>>>
>>> # inconsistently transforms fill_value
>>> maybe_promote(np.dtype('datetime64[ns]'), True)
(<class 'numpy.object_'>, nan)
>>>
>>> # should upcast to object
>>> maybe_promote(np.dtype('datetime64[ns]'), np.array([True]))
(dtype('<M8[ns]'), -9223372036854775808)
>>>
>>> # should upcast to object
>>> maybe_promote(np.dtype('bytes'), np.array([True]))
(dtype('S'), nan)
>>>
>>> # inconsistently transforms fill_value
>>> maybe_promote(np.dtype('datetime64[ns]'), np.datetime64(1, 'ns'))
(dtype('<M8[ns]'), 1)
>>>
>>> # should (IMO) cast to object (cf below)
>>> maybe_promote(np.dtype('datetime64[ns]'), 1e10)
(dtype('<M8[ns]'), 10000000000)
>>>
>>> # inconsistently transforms fill_value
>>> maybe_promote(np.dtype('datetime64[ns]'), 1e20)
(<class 'numpy.object_'>, nan)
>>>
>>> # does not upcast correctly (but implicitly turns string to int)
>>> maybe_promote(np.dtype('timedelta64[ns]'), '1')
(dtype('<m8[ns]'), 1)
>>>
>>> # should upcast to object
>>> maybe_promote(np.dtype('int64'), np.timedelta64(1, 'ns'))
(dtype('<m8[ns]'), numpy.timedelta64(1,'ns'))
>>>
>>> # should upcast to object
>>> maybe_promote(np.dtype('float64'), np.timedelta64(1, 'ns'))
(dtype('float64'), numpy.timedelta64(1,'ns'))
>>>
>>> # should upcast to object
>>> maybe_promote(np.dtype('int64'), np.array([1], dtype=str))
(<class 'numpy.float64'>, nan)
>>>
>>> # should upcast to object
>>> maybe_promote(np.dtype(bytes), np.nan)
(dtype('S'), nan)
>>>
>>> # falsely mangles None into NaN
>>> maybe_promote(np.dtype(object), None)
(<class 'numpy.object_'>, nan)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants