BUG/API: np.array([0, max_uint64]) has float64 dtype #19146

jbrockmendel · 2021-05-31T21:47:14Z

I expected to get uint64

umax = np.iinfo(np.uint64).max

np.array([umax]).dtype   # <-- uint64, as expected

np.array([0, umax]).dtype  # <-- float64, surprising

There seems to be something special going on inference-wise around the int64 bound:

imax = np.iinfo(np.int64).max

np.array([imax, umax]).dtype  # <-- float64

np.array([imax+1, umax]).dtype  # <-- uint64

The text was updated successfully, but these errors were encountered:

charris · 2021-05-31T23:19:53Z

This is the downside of value based type inference of Python scalars. The zero is converted as signed, umax as unsigned. Then int64 plus uint64 -> float.

In [5]: array(np.iinfo(np.uint64).max).dtype                                    
Out[5]: dtype('uint64')

In [6]: array(0).dtype                                                          
Out[6]: dtype('int64')

It isn't a bug, but definitely a wart. The best option when mixing unsigned and signed is to specify the dtype.

EDIT: The way it works is that first conversion to signed is tried. If it fails, then conversion to unsigned is tried.

jbrockmendel · 2021-05-31T23:59:50Z

The way it works is that first conversion to signed is tried. If it fails, then conversion to unsigned is tried.

can you point me to the relevant part of the code?

context: im trying to make pandas inference/constructors do fewer passes

charris · 2021-06-01T01:01:14Z

can you point me to the relevant part of the code?

Heh, I saw it once upon a time . . . There is similar code in three files: convert.c, scalarapi.c, and abstractdtypes.c. The repetition with some slight differences is unsettling. The last is probably what you are looking for.

static PyArray_Descr *
discover_descriptor_from_pyint(
        PyArray_DTypeMeta *NPY_UNUSED(cls), PyObject *obj)
{
    assert(PyLong_Check(obj));
    /*
     * We check whether long is good enough. If not, check longlong and
     * unsigned long before falling back to `object`.
     */
    long long value = PyLong_AsLongLong(obj);
    if (error_converting(value)) {
        PyErr_Clear();
    }
    else {
        if (NPY_MIN_LONG <= value && value <= NPY_MAX_LONG) {
            return PyArray_DescrFromType(NPY_LONG);
        }
        return PyArray_DescrFromType(NPY_LONGLONG);
    }

    unsigned long long uvalue = PyLong_AsUnsignedLongLong(obj);
    if (uvalue == (unsigned long long)-1 && PyErr_Occurred()){
        PyErr_Clear();
    }
    else {
        return PyArray_DescrFromType(NPY_ULONGLONG);
    }

    return PyArray_DescrFromType(NPY_OBJECT);
}

The topic of value based conversion has been discussed as part of the new dtype work, @seberg might have more to say.

seberg · 2021-06-02T15:10:10Z

Hmm, the last one should be the important one, yeah. There are two things to note here:

The integer "ladder" is a bit distinct from value-based promotion/casting. This is long -> long long -> uint long -> object
np.array([1, np.uint64(3)]) does not really use value-based promotion. It uses the "default" for the first integer. (I had a first version once that was capable of using value-based promotion, the current code is not. Right now I think that is probably for the better.)

I.e. np.array([1, np.uint64(3)]) just promotes whatever the 1 is considered based on the "ladder". I suppose that is also a form of value-based promotion, but it is distinct from typical value-based promotion.

Future:

My opinion is currently:

We actually attempt to get rid of value-based promotion entirely (hopefully in the next 1-2 months and then see how that goes) – There may be quite a bit of updating in pandas necessary.
We could try to remove that "integer ladder" and always go to the default integer. An error would be raised if assignment fails in that case.

jbrockmendel · 2021-06-02T17:46:24Z

We actually attempt to get rid of value-based promotion entirely (hopefully in the next 1-2 months and then see how that goes) – There may be quite a bit of updating in pandas necessary.

Do you mean you wouldn't do any inference in np.array when passed a list and no dtype? I must be misunderstanding.

We could try to remove that "integer ladder" and always go to the default integer. An error would be raised if assignment fails in that case.

Is a "best lossless" option on the table? (basically what clean_index_list described below aims for)

can you point me to the relevant part of the code?

Poor wording on my part. I was actually asking about the part of the code that iterates over a not-yet-ndarray sequence to infer a dtype as part of the constructor. This would correspond to some combination of pd._libs' lib.infer_dtype and lib.maybe_convert_objects.

The kind of pattern that im looking to avoid is in e.g. lib.clean_index_list where we do

    inferred = infer_dtype(obj, skipna=False)
    [...]
    elif inferred in ['integer']:
        # we infer an integer but it *could* be a uint64

        arr = np.asarray(obj)
        if arr.dtype.kind not in ["i", "u"]:
            # eg [0, uint64max] gets cast to float64,
            #  but then we know we have either uint64 or object
            if (arr < 0).any():
                # TODO: similar to maybe_cast_to_integer_array
                return np.asarray(obj, dtype="object"), 0

            # GH#35481
            guess = np.asarray(obj, dtype="uint64")
            return guess, 0

infer_dtype does a pass through the array, then np.asarray is N-pass for N of I'm guessing 1 or 2, then the (arr < 0).any() is 2 passes plus an allocation, ...

This seems like it should be doable in way fewer passes.

seberg · 2021-06-02T20:22:24Z

Do you mean you wouldn't do any inference in np.array when passed a list and no dtype?

Sorry, "value-based promotion" is not currently used for np.array, it just uses "plain" promotion! The value-based part comes in mainly for np.result_type and operations, such as float32_array + 4. returning a float32 and not a float64!

EDIT: So there will be no change here, inference of course happens, the question is how smart it is. The different integers being used when integers are large may go away though.

Is a "best lossless" option on the table?

Some thoughts below, but maybe we should chat about this a bit? I think that pandas could leverage NumPy in principle, and that may well be worth the trouble. Although, I am a bit worried that it will also be a bit of a hack. On the other hand, I am not sure how well pandas could currently deal with NumPy user DType, and this might go a long way to that?

A correct "best" lossless for signed or unsigned integers seems pretty tough (but I guess you do not need that?). Also the current NumPy implementation inside of np.array(...) is slightly limited compared to other promotion. That is, it currently only with dtype instances/descriptors and not really directly with DType (type/cass).
That means, it doesn't actually support all of the value-based promotion, np.array([np.float32(3.), 3.]) cannot result in a float32. That is, because this would currently be implemented as an abstract DType (for float32_arr + 3. the 3. is an abstract integer DType). But: even there, I do not want value-based support in any case (it can be done, but would be hackish to fully support and it doesn't seem like anyone actually likes it anyway.)

Now for pandas? Even in the above np.array([np.float32(3.), 3.]) could be hacked, if we wrote something like:

np.array([np.float32(3.), 3.], dtype=PythonFloatAsFloat16DType)

the problem is, that if you only have python floats, you get a float16 result ;). (The smaller problem is, that I am not sure I have implemented enough of casting yet to do the above.)

For you, even that can't possibly be enough. You would need to track the current state in form of a dtype instance. That would be something like an "abstract dtype instance". And instance that cannot be attached to an actual array (or if it was, will always result in errors)!

That feels hackish, but to be honest, should work just fine. All we need to ensure is that an error is raised when its used (we need that anyway probably). And a way to convert that instance to the actual one that can be attached to the NumPy array.

In theory, that could be a method that is automatically called, i.e. dtype = dtype.as_concrete() (where as_concrete() is only called when necessary and a no-op for any normal dtype/descriptor). In practice, it doesn't have to be, since this would be hidden in the pandas internals.

We test on more architectures, so upstream's xfails are not always correct everywhere. On those known to fail: arm64 xfail -> all non-x86 xfail x86 or unconditional strict xfail -> unconditional nonstrict xfail Author: Rebecca N. Palmer <rebecca_palmer@zoho.com> Bug: pandas-dev/pandas#38921, pandas-dev/pandas#38798, pandas-dev/pandas#41740, numpy/numpy#19146 Forwarded: no Gbp-Pq: Name fix_overly_arch_specific_xfails.patch

We test on more architectures, so upstream's xfails are not always correct everywhere. On those known to fail: arm64 xfail -> all non-x86 xfail x86 or unconditional strict xfail -> unconditional nonstrict xfail pandas/tests/window/test_rolling.py also gets an i386 xfail for rounding error that may be x87 excess precision Author: Rebecca N. Palmer <rebecca_palmer@zoho.com> Bug: pandas-dev/pandas#38921, pandas-dev/pandas#38798, pandas-dev/pandas#41740, numpy/numpy#19146 Forwarded: no Gbp-Pq: Name fix_overly_arch_specific_xfails.patch

We test on more architectures, so upstream's xfails are not always correct everywhere. On those known to fail: arm64 xfail -> all non-x86 xfail x86 or unconditional strict xfail -> unconditional nonstrict xfail Author: Rebecca N. Palmer <rebecca_palmer@zoho.com> Bug: pandas-dev/pandas#38921, pandas-dev/pandas#38798, pandas-dev/pandas#41740, numpy/numpy#19146 Forwarded: no Gbp-Pq: Name fix_overly_arch_specific_xfails.patch

jbrockmendel mentioned this issue Jun 3, 2021

API/BUG: Series(floating, dtype=intlike) ignores dtype, DataFrame casts pandas-dev/pandas#40110

Closed

fangchenli mentioned this issue Jun 13, 2021

TST: fix xpass for M1 Mac pandas-dev/pandas#41982

Merged

seberg mentioned this issue Jun 8, 2022

BUG: NumPy auto conversion to avoid overflow sometimes didn't work #21671

Closed

adrianeboyd mentioned this issue Dec 5, 2022

DOC: Improve 1.24.0 release notes on converting out-of-bound Python integers #22733

Closed

seberg mentioned this issue Jul 2, 2024

BUG: Numpy object arrays improperly convert python-int's to python-float's. #26818

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/API: np.array([0, max_uint64]) has float64 dtype #19146

BUG/API: np.array([0, max_uint64]) has float64 dtype #19146

jbrockmendel commented May 31, 2021

charris commented May 31, 2021 •

edited

Loading

jbrockmendel commented May 31, 2021

charris commented Jun 1, 2021 •

edited

Loading

seberg commented Jun 2, 2021

jbrockmendel commented Jun 2, 2021

seberg commented Jun 2, 2021 •

edited

Loading

BUG/API: np.array([0, max_uint64]) has float64 dtype #19146

BUG/API: np.array([0, max_uint64]) has float64 dtype #19146

Comments

jbrockmendel commented May 31, 2021

charris commented May 31, 2021 • edited Loading

jbrockmendel commented May 31, 2021

charris commented Jun 1, 2021 • edited Loading

seberg commented Jun 2, 2021

jbrockmendel commented Jun 2, 2021

seberg commented Jun 2, 2021 • edited Loading

charris commented May 31, 2021 •

edited

Loading

charris commented Jun 1, 2021 •

edited

Loading

seberg commented Jun 2, 2021 •

edited

Loading