Did you know that py::buffer_info::format has a different meaning on Windows? #1908

jpivarski · 2019-09-04T01:44:47Z

On Windows (both 32-bit and 64-bit Python),

std::string f(py::array a) {
    py::buffer_info info = a.request();
    return info.format;
}

returns "l" for a numpy.int32 array and "q" for a numpy.int64 array, whereas MacOS and Linux return "i" and "l", respectively. (That is, "l" is ambiguous.)

To be safe against misinterpretations, I'm using

auto b = a.cast<py::array_t<int64_t, py::array::c_style | py::array::forcecast>>();

to convert whatever I have into int64_t (and similarly for int32_t), if necessary. (Then, choosing int32_t vs int64_t based on platform is just an optimization, not needed for correctness.)

The text was updated successfully, but these errors were encountered:

jpivarski · 2020-02-07T16:47:50Z

The same is true on a 32-bit Docker image (i386/ubuntu):

numpy.int32 → format == "l"
numpy.int64 → format == "q"

whereas on 64-bit Linux:

numpy.int32 → format == "i"
numpy.int64 → format == "l"

Windows, regardless of whether it's 32-bit or 64-bit, take the same meanings as 32-bit Linux.

jpivarski · 2020-02-07T17:06:37Z

Idioms like the following are successful on 32-bit Linux, 64-bit Linux, 32-bit Windows, and 64-bit Windows (and 64-bit MacOS; haven't tested 32-bit MacOS). It's possible that the __i386__ macro is only defined on gcc, so that's another thing I'll have to consider.

    if (format_.compare("d") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::float64);
    }
    else if (format_.compare("f") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::float32);
    }
#if defined _MSC_VER || defined __i386__
    else if (format_.compare("q") == 0) {
#else
    else if (format_.compare("l") == 0) {
#endif
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::int64);
    }
#if defined _MSC_VER || defined __i386__
    else if (format_.compare("Q") == 0) {
#else
    else if (format_.compare("L") == 0) {
#endif
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::uint64);
    }
#if defined _MSC_VER || defined __i386__
    else if (format_.compare("l") == 0) {
#else
    else if (format_.compare("i") == 0) {
#endif
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::int32);
    }
#if defined _MSC_VER || defined __i386__
    else if (format_.compare("L") == 0) {
#else
    else if (format_.compare("I") == 0) {
#endif
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::uint32);
    }
    else if (format_.compare("h") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::int16);
    }
    else if (format_.compare("H") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::uint16);
    }
    else if (format_.compare("b") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::int8);
    }
    else if (format_.compare("B") == 0  ||  format_.compare("c") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::uint8);
    }
    else if (format_.compare("?") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::boolean);
    }

YannickJadoul · 2020-07-11T22:33:14Z

I believe this is a combination of C/C++ not defining the sizes of integer types and the way numpy handles this.

Is there any particular issue when using this from pybind11, that doesn't match C/C++/numpy? Or do you suggest we add something to the docs somewhere, or ...?

jpivarski · 2020-07-13T18:24:17Z

Ultimately, all I need is a work-around, so if I'm given a buffer_info, I know how to cast it using stdint types. Above, I gave my work-around in the hope that it would help others, but recently, we've been running into more format related bugs that we'll cross-reference to this issue. At the moment, we don't know what's going wrong in the current work-around, so I don't have a specific request yet.

If this does get figured out, documentation is probably all that's needed, maybe a small section at the end of this page, since that's where we learn about the buffer_info interface and how to work with NumPy arrays. Maybe the best thing would be to have a listing of all the possible values format can take, with how each should be cast? (I just ran into 'e' for float16, as a result of taking the square root of an int8 in NumPy. It would be better not to discover each case, one by one, in production code.)

YannickJadoul · 2020-07-14T14:06:29Z

This does seem to mainly be a numpy feature/issue, though, where the meaning of i, l, q is int, long, long long, the size of these being platform-dependent, and np.int32 and np.int64 are aliases:

Testing out with 64-bit manylinux2010 Docker image:

$ docker run -it --rm quay.io/pypa/manylinux2010_x86_64 bash
[root@9962fd909c16 /]# /opt/python/cp38-cp38/bin/pip install numpy
Collecting numpy
  Downloading numpy-1.19.0-cp38-cp38-manylinux2010_x86_64.whl (14.6 MB)
     |████████████████████████████████| 14.6 MB 97 kB/s 
Installing collected packages: numpy
Successfully installed numpy-1.19.0
[root@9962fd909c16 /]# /opt/python/cp38-cp38/bin/python
Python 3.8.3 (default, Jun 16 2020, 21:10:27) 
[GCC 8.3.1 20190311 (Red Hat 8.3.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.dtype(np.int32).char
'i'
>>> np.dtype(np.int64).char
'l'
>>>

32-bit:

$ docker run -it --rm quay.io/pypa/manylinux2010_i686 bash
[root@795a929a35e8 /]# /opt/python/cp38-cp38/bin/pip install numpy
Collecting numpy
  Downloading numpy-1.19.0-cp38-cp38-manylinux2010_i686.whl (12.3 MB)
     |████████████████████████████████| 12.3 MB 513 kB/s 
Installing collected packages: numpy
Successfully installed numpy-1.19.0
[root@795a929a35e8 /]# /opt/python/cp38-cp38/bin/python
Python 3.8.3 (default, Jul  4 2020, 10:49:13) 
[GCC 8.3.1 20190311 (Red Hat 8.3.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.dtype(np.int32).char
'l'
>>> np.dtype(np.int64).char
'q'
>>>

See e.g., here: https://numpy.org/doc/stable/reference/arrays.scalars.html#built-in-scalar-types

I guess we could refer to these numpy docs, in some warning, when talking about dtypes in pybind11? If so, do you have time to make a quick PR on that?

jpivarski · 2020-07-14T19:06:26Z

I guess we could refer to these numpy docs, in some warning, when talking about dtypes in pybind11? If so, do you have time to make a quick PR on that?

I could. In the meantime, I'll use this space to try figuring things out, out in the open.

To try to get a handle on all the possible numeric types NumPy can handle, I scanned all objects attached to the numpy module, looking for array types, like this:

>>> [x for x in dir(numpy)
...  if isinstance(getattr(numpy, x), type) and issubclass(getattr(numpy, x), numpy.generic)]

then excluded any types that are not a leaf in the class hierarchy with:

[x for x in dir(numpy)
...  if isinstance(getattr(numpy, x), type) and issubclass(getattr(numpy, x), numpy.generic)]

Accounting for multiple names for the same type objects, nothing on this list has been left out of the above. Creating an array of each of these types, passing it into buffer_info, this is what I get for the format strings:

	Linux 32-bit	Linux 64-bit	MacOS 64-bit	Windows 32-bit	Windows 64-bit	diff?
`bool_`	`?`	`?`	`?`	`?`	`?`
`int8`	`b`	`b`	`b`	`b`	`b`
`int16`	`h`	`h`	`h`	`h`	`h`
`int32`	`l`	`i`	`i`	`l`	`l`	yes
`int64`	`q`	`l`	`l`	`q`	`q`	yes
`uint8`	`B`	`B`	`B`	`B`	`B`
`uint16`	`H`	`H`	`H`	`H`	`H`
`uint32`	`L`	`I`	`I`	`L`	`L`	yes
`uint64`	`Q`	`L`	`L`	`Q`	`Q`	yes
`intc`	`i`	`i`	`i`	`i`	`i`
`uintc`	`I`	`I`	`I`	`I`	`I`
`longlong`	`q`	`q`	`q`	`q`	`q`
`ulonglong`	`Q`	`Q`	`Q`	`Q`	`Q`
`float16`	`e`	`e`	`e`	`e`	`e`
`float32`	`f`	`f`	`f`	`f`	`f`
`float64`	`d`	`d`	`d`	`d`	`d`
`float128`	N/A	`g`	`g`	N/A	N/A	yes
`complex64`	`Zf`	`Zf`	`Zf`	`Zf`	`Zf`
`complex128`	`Zd`	`Zd`	`Zd`	`Zd`	`Zd`
`complex256`	N/A	`Zg`	`Zg`	N/A	N/A	yes
`datetime64`	`M`	`M`	`M`	`M`	`M`
`timedelta64`	`m`	`m`	`m`	`m`	`m`
`bytes_`	`3s`	`3s`	`3s`	`3s`	`3s`
`str_`	`3w`	`3w`	`3w`	`3w`	`3w`
`record`	`T{...}`	`T{...}`	`T{...}`	`T{...}`	`T{...}`
`object_`	`O`	`O`	`O`	`O`	`O`

On all systems where I could test it (Linux 64-bit, MacOS 64-bit, Windows 32-bit, and Windows 64-bit), Python 2.7 differed from Python 3.6+ only in that bytes_ and str_ are not distinguished (both are 3s). Numeric types are unaffected.

datetime64 and timedelta64 (M and m) both raised "ValueError: cannot include dtype 'M' in a buffer"; I'm using pybind 2.4.3.

A bug that we recently encountered involves a distinction between int64 and longlong. On 64-bit Linux, numpy.int64 is identically equal to numpy.intp, but numpy.longlong is a distinct object. I wonder why.

>>> numpy.int32, numpy.int64, numpy.intp, numpy.longlong
(<class 'numpy.int32'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.longlong'>)

bstaletic · 2020-07-14T19:13:00Z

On 64-bit Linux, numpy.int64 is identically equal to numpy.intp, but numpy.longlong is a distinct object. I wonder why.

I guess because intptr_t and int64_t are typedefs of long and long long is a distinct type.

YannickJadoul · 2020-07-15T15:39:51Z

I could. In the meantime, I'll use this space to try figuring things out, out in the open.

Thanks! And yes, please do; no problem at all to aggregate everything together in one place!

Creating an array of each of these types, passing it into buffer_info, this is what I get for the format strings:

Nice overview! So the main thing to note - I believe - is that intc, uintc, longlong, ulonglong, etc (did you somehow miss long and ulong?) are consistent across platforms (even if they have different physical sizes). So numpy defines the format strings based on the C types, and not based on the actual size of each integer type. We can debate on whether that's a good idea or not (very likely depends on the actual case), but there's not a lot we can do about it.

Note that you can compare dtype objects, btw:

>>> np.dtype(np.intc) == np.dtype(np.long)
False
>>> np.dtype(np.long) == np.dtype(np.longlong)
True

By some weird coincidence, I once made a PR on numpy about these dtype docstrings (during/after the sprints at EuroSciPy), and I while I don't remember all details, even writing docs for this was quite hairy given that you have these aliases. Maybe this is useful to look at: numpy/numpy#11858?

datetime64 and timedelta64 (M and m) both raised "ValueError: cannot include dtype 'M' in a buffer"; I'm using pybind 2.4.3.

If I'm not mistaken, there are some issues/PRs open on this, that we haven't gotten to yet.

bstaletic · 2020-07-15T16:10:45Z

I believe - is that intc, uintc, longlong, ulonglong, etc (did you somehow miss long and ulong?) are consistent across platforms (even if they have different physical sizes).

I think your conclusion is wrong here.

https://en.cppreference.com/w/cpp/language/types

Unix-like OS's always have int == 32bits, as does Win64. Same thing for long long - always 64 bits. That's why the format is consistent.

YannickJadoul · 2020-07-15T16:31:08Z

Unix-like OS's always have int == 32bits, as does Win64. Same thing for long long - always 64 bits. That's why the format is consistent.

Huh, interesting. But I don't think it changes anything. The thing is just that i is defined to be the format character for intc, according to https://numpy.org/doc/stable/reference/arrays.scalars.html#built-in-scalar-types, while the other, sized types (int32, etc) are just aliases to a type with that amount of bits.

EricCousineau-TRI · 2020-07-17T19:13:35Z

@jpivarski I believe your table effectively arises from the (rather complex) set of aliasing rules from this header file:
https://github.com/numpy/numpy/blob/6ef5ec39cdfaf77aa4600ec2e3bf9f679a4fd527/numpy/core/include/numpy/npy_common.h

In #1329, I thought I had aligned this with how it was defined in the latest version of NumPy (at least in time of writing); however, it does seem like that assumption may have been rather brittle, as noted by this issue.

@YannickJadoul pointed it out in this review comment:
#1329 (review)

I'm just wondering if there's a robust way to test this out in CI...

jpivarski · 2020-07-17T19:27:52Z

For workarounds, I've pivoted from trying to interpret each format as a distinct type to using format only for what NumPy calls "kind" and getting the size from itemsize: https://github.com/scikit-hep/awkward-1.0/blob/dd8cd018373efe44740adaf5148a1bb9dafb3124/src/libawkward/util.cpp#L126-L221

For instance,

      else if (fmt == std::string("b")  ||
               fmt == std::string("h")  ||
               fmt == std::string("i")  ||
               fmt == std::string("l")  ||
               fmt == std::string("q")) {
        if (itemsize == 1) {
          return dtype::int8;
        }
        else if (itemsize == 2) {
          return dtype::int16;
        }
        else if (itemsize == 4) {
          return dtype::int32;
        }
        else if (itemsize == 8) {
          return dtype::int64;
        }

(Sometimes, there's also an endianness character, >, <, =, even if the dtype happens to be native endian, so we have to skip that, too.)

As you can see, I've mapped it to a platform-independent enum and I just use the enum subsequently (but keep the format string around, in case a structured array or array of strings has to be passed through but not interpreted).

That's what #1329 is about, right? If pybind provides a platform-independent enum, then there's nothing to document—we should just use that enum, right?

YannickJadoul · 2020-07-18T21:15:32Z

For instance,

      else if (fmt == std::string("b")  ||
               fmt == std::string("h")  ||
               fmt == std::string("i")  ||
               fmt == std::string("l")  ||
               fmt == std::string("q")) {
        if (itemsize == 1) {
          return dtype::int8;
        }
        else if (itemsize == 2) {
          return dtype::int16;
        }
        else if (itemsize == 4) {
          return dtype::int32;
        }
        else if (itemsize == 8) {
          return dtype::int64;
        }

I believe this makes sense, indeed.

Maybe there ought to be a better way to handle/expose this in pybind11, though. I suppose this mapping should be known at compile time?

If pybind provides a platform-independent enum, then there's nothing to document—we should just use that enum, right?

pybind11 doesn't use numpy's headers though. (I suppose since we don't want to depend on them? I don't know, actually.) Instead we access some hidden Python capsule in numpy that contains function pointers to the API. So this might be easier said that done.

bstaletic · 2020-07-21T09:08:32Z

From what I can tell, according to https://numpy.org/doc/stable/reference/arrays.scalars.html#built-in-scalar-types, there's a few things pybind could improve.

Use NPY_SHORT_ instead of NPY_INT16, as to not assume the sizes of short, int and others.
We're missing NPY_LONG and NPY_ULONG, which are specified to match a python int, as opposed to long.
We're missing NPY_INTP and NPY_UINTP, corresponding to intptr_t and uintptr_t.
A half precision float is missing, though that's missing from the C standard as well. However, there's not complex counterpart to the scalar f16.
Finally py::object is missing its format descriptor, but I'm not sure if we want to support this one, since everything is predicated on std::is_arithmetic.

pybind11 doesn't use numpy's headers though. (I suppose since we don't want to depend on them? I don't know, actually.)

On my distro, numpy headers are in site-packages/numpy/core/include...

Instead we access some hidden Python capsule in numpy that contains function pointers to the API.

It's the numpy.core.multiarray._ARRAY_API capsule.

wjakob · 2020-07-21T09:11:49Z

The replication of NumPy header contents was a conscious decision to break a problematic dependency. Otherwise every project that supports a function call with a py::array somewhere will require NumPy to be installed, header file search paths to be set up correctly, etc.

YannickJadoul · 2020-07-21T13:59:53Z

there's a few things pybind could improve.

Agree with all of these.
py::object would be nice, though, but that might just be a different improvement PR, rather than the one fixing the others.

Otherwise every project that supports a function call with a py::array somewhere will require NumPy to be installed, header file search paths to be set up correctly, etc.

Only for compilation, right; and if you're using numpy.h, you probably have numpy installed as developer? And not if you download a wheel as a user? But yes, this work, so why not keep it.

henryiii · 2020-07-31T01:10:40Z

By the way, the decision to avoid a NumPy dependency does make it easier to build - that's a difference between Cython wheel builds and PyBind11. With Cython, you have have min NumPy dependencies, something like this:

<something or other for 2.7 and 3.5, I don't remember>
numpy==1.12.1; python_version=="3.6"
numpy==1.14.5; python_version=="3.7"
numpy==1.17.3; python_version>="3.8"

When you build a wheel, you have to have the oldest version of NumPy supported, since a wheel built with a newer NumPy can't be used by an older NumPy, and older NumPy's don't have wheels/support for newer Pythons. Now modern tooling, like PEP 518's pyproject.toml, and FindPython natively supporting NumPy in CMake 3.14, and cibuildwheel, makes this much easier than it used to be, but it's still not as simple and elegant as PyBind11's not requiring it at all.

Python integer format char is ambiguous and platform dependent. PyBind11 `format_descriptor<...>::format()` always returns "q" and "Q" for 64bit integers, independent of the platform. Compatible passed-in Python buffers on the other hand might also have the equivalent format "l" or "L" set. See pybind/pybind11#1806 and pybind/pybind11#1908 for details. This fix introduces a special case for integer format comparisons, just checking size and signedness.

rwgk · 2023-05-19T18:22:16Z

FWIW, I had a run-in with this issue, too, but then found that there is already a solution, which is made more visible & accessible under #4674 (py::buffer_info::item_type_is_equivalent_to<T>).

In Python, implementing type resolution across Windows and Linux/macOS is tricky, as the same `l` specifier can map to both `int32` and `int64`. Moreover, expanding to to 8-byte integers the `u8` specifiers can mean both 8-bit and 8-byte integers, causing confusion. The same issue is discussed in detail in the PyBind11 repo: pybind/pybind11#1908

jpivarski mentioned this issue Jul 13, 2020

Numpy dtype of Index64 scikit-hep/awkward#335

Closed

jpivarski mentioned this issue Jul 14, 2020

Try to fully resolve the NumPy format string issues. scikit-hep/awkward#337

Merged

14 tasks

YannickJadoul mentioned this issue Jul 17, 2020

Wrong buffer format returned for similar numpy arrays passed to pybind11 function #1806

Closed

EricCousineau-TRI mentioned this issue Jul 17, 2020

numpy: Provide concrete size aliases, test equivalence for dtype(...).num #1329

Merged

1 task

hawkinsp mentioned this issue Nov 16, 2020

np.cint/np.int32 type confusion jax-ml/jax#4903

Closed

anntzer mentioned this issue Dec 8, 2020

pybind11/numpy.h does not require numpy at build time. #2720

Merged

fthaler mentioned this issue Apr 9, 2021

Fix for Integer Format Check in Python SID Adapter GridTools/gridtools#1632

Merged

lostmsu mentioned this issue Aug 25, 2021

Implements buffer interface for .NET arrays of primitive types pythonnet/pythonnet#1511

Merged

3 tasks

nilsnolde mentioned this issue Jan 2, 2022

build windows with conan support VROOM-Project/pyvroom#9

Merged

rwgk mentioned this issue Feb 10, 2023

FWD pybind11 google/pybind11clif#1908

Closed

ScottTodd mentioned this issue Mar 31, 2023

Arrays do not match in python/vm_types_test:test_variant_list_buffers on Windows iree-org/iree#11080

Closed

rwgk mentioned this issue May 18, 2023

Add format_descriptor<> & npy_format_descriptor<> PyObject * specializations. #4674

Merged

a-sevin mentioned this issue Aug 4, 2023

ISIOWrap broken for int64/uint64 (on some architectures?) milk-org/ImageStreamIO#51

Closed

dastrobu mentioned this issue Jan 2, 2024

[BUG]: pybind11::format_descriptor<float16_t>::format() throws RuntimeError: NumPy type info missing for Dh but should return e #4998

Open

3 tasks

SioulisChris mentioned this issue Apr 15, 2024

MSBuild errors artem-ogre/PythonCDT#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Did you know that py::buffer_info::format has a different meaning on Windows? #1908

Did you know that py::buffer_info::format has a different meaning on Windows? #1908

jpivarski commented Sep 4, 2019

jpivarski commented Feb 7, 2020

jpivarski commented Feb 7, 2020

YannickJadoul commented Jul 11, 2020

jpivarski commented Jul 13, 2020

YannickJadoul commented Jul 14, 2020

jpivarski commented Jul 14, 2020

bstaletic commented Jul 14, 2020

YannickJadoul commented Jul 15, 2020

bstaletic commented Jul 15, 2020

YannickJadoul commented Jul 15, 2020

EricCousineau-TRI commented Jul 17, 2020

jpivarski commented Jul 17, 2020

YannickJadoul commented Jul 18, 2020

bstaletic commented Jul 21, 2020

wjakob commented Jul 21, 2020

YannickJadoul commented Jul 21, 2020

henryiii commented Jul 31, 2020

rwgk commented May 19, 2023

Did you know that py::buffer_info::format has a different meaning on Windows? #1908

Did you know that py::buffer_info::format has a different meaning on Windows? #1908

Comments

jpivarski commented Sep 4, 2019

jpivarski commented Feb 7, 2020

jpivarski commented Feb 7, 2020

YannickJadoul commented Jul 11, 2020

jpivarski commented Jul 13, 2020

YannickJadoul commented Jul 14, 2020

jpivarski commented Jul 14, 2020

bstaletic commented Jul 14, 2020

YannickJadoul commented Jul 15, 2020

bstaletic commented Jul 15, 2020

YannickJadoul commented Jul 15, 2020

EricCousineau-TRI commented Jul 17, 2020

jpivarski commented Jul 17, 2020

YannickJadoul commented Jul 18, 2020

bstaletic commented Jul 21, 2020

wjakob commented Jul 21, 2020

YannickJadoul commented Jul 21, 2020

henryiii commented Jul 31, 2020

rwgk commented May 19, 2023