Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load arrow tables into Perspective without converting to a binary buffer #1157

Open
sc1f opened this issue Aug 19, 2020 · 24 comments
Open

Load arrow tables into Perspective without converting to a binary buffer #1157

sc1f opened this issue Aug 19, 2020 · 24 comments
Labels
C++ enhancement Feature requests or improvements JS Python

Comments

@sc1f
Copy link
Contributor

sc1f commented Aug 19, 2020

Feature Request

Description of Problem:

Perspective expects arrow data to be loaded as an ArrayBuffer in Javascript, and a binary string in Python. This requires, in PyArrow at least, a few lines to convert an Arrow Table into binary:

stream = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(stream, arrow_table.schema)
writer.write_table(arrow_table)
writer.close()
perspective_table = perspective.Table(stream.getvalue().to_pybytes())

When loading arrow Tables, I expect Perspective to be compatible with Arrow Tables without having to do any conversion. The requirement of bytes for an Arrow binary is outlined slightly in the Python user guide, but the conversion process from an Arrow Table -> bytes is unclear.

Potential Solutions:

Write in an Arrow Table to binary conversion layer solely in the binding layer (using PyArrow or Arrow typescript), which would be simple but incomplete (would require reimplementation in future binding languages) and less performant. If there is a way to transfer a pointer to an arrow table from the binding layer into C++, then we should be able to write the conversion entirely in C++. We already malloc and memcpy the arrow binary from JS/Python into C++, so there might be something already there worth looking into.

@sc1f sc1f added C++ enhancement Feature requests or improvements JS Python labels Aug 19, 2020
@timkpaine
Copy link
Member

There is a way to transfer the pointer, I discussed this with @wesm a while back and it was eventually implemented to optimize transferring arrow tables between R and python. Let me look up the details of that thread

@timkpaine
Copy link
Member

timkpaine commented Aug 19, 2020

On Sep 10, 2019, at 19:35, Wes McKinney wesmckinn@gmail.com wrote:

Hi Tim,

I see what you're saying now, sorry that I didn't understand sooner.

We actually need this feature to be able to pass instances of
shared_ptr (under very controlled conditions) into R using
reticulate, where T is any of

  • Array
  • ChunkedArray
  • DataType
  • RecordBatch
  • Table
  • and some other classes

I would suggest introducing a property on pyarrow Python objects that
returns the memory address of the wrapped shared_ptr (i.e. the
integer leading to shared_ptr*). Then you can create your copy of
that. Would that work? The only reason this is not implemented is that
no one has needed it yet, mechanically it does not strike me as that
complex.

See https://issues.apache.org/jira/browse/ARROW-3750. My comment in
November 2018 "Methods would need to be added to the Cython extension
types to give the memory address of the smart pointer object they
contain". I agree with my younger self. Are you up to submit a PR?

  • Wes

On Tue, Sep 10, 2019 at 6:31 PM Tim Paine t.paine154@gmail.com wrote:

The end goal is to go direct from pyarrow to wasm without intermediate transforms. I can definitely make it work as is, we'll just have to be careful that the code we compile to webassembly matches exactly either our local copy of arrow if the user hasn't installed pyarrow, otherwise their installed copy.

On Sep 10, 2019, at 19:12, Tim Paine t.paine154@gmail.com wrote:

We're building webassembly, so we obviously don't want to introduce a pyarrow dependency. I don't want to do any pyarrow manipulations in c++, just get the c++ table. I was hoping pyarrow might expose a raw pointer or have something castable.

It seems to be a big limitation, there is no way of communicating a pyarrow table to a c++ library that uses arrow without that library linking against pyarrow.

Tim

On Sep 10, 2019, at 17:44, Wes McKinney wesmckinn@gmail.com wrote:

The Python extension types are defined in Cython, not C or C++ so you need
to load the Cython extensions in order to instantiate the classes.

Why do you have 2 copies of the C++ library? That seems easy to fix. If you
are using wheels from PyPI I would recommend that you switch to conda or
your own wheels without the C++ libraries bundled.

On Tue, Sep 10, 2019, 4:23 PM Tim Paine t.paine154@gmail.com wrote:

Is there no way to do it without PyArrow? My C++ library is building arrow
itself, which means if I use PyArow I’ll end up having 2 copies: one from
my local C++ only build, and one from PyArrow.

On Sep 10, 2019, at 5:18 PM, Wes McKinney wesmckinn@gmail.com wrote:

hi Tim,

You can use the functions in

https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/pyarrow.h

You need to call "import_pyarrow()" from C++ before these APIs can be
used. It's similar to the NumPy C API in that regard

  • Wes

On Tue, Sep 10, 2019 at 4:13 PM Tim Paine t.paine154@gmail.com wrote:

Hey all, following up on a question I asked on stack overflow <
https://stackoverflow.com/questions/57863751/how-to-convert-pyarrow-table-to-arrow-table-when-interfacing-between-pyarrow-in
.

It seems there is some code <
https://arrow.apache.org/docs/python/extending.html#_CPPv412unwrap_tableP8PyObjectPNSt10shared_ptrI5TableEE>
in PyArrow’s C++ to convert from a PyArrow table to an Arrow table. The
problem with this is that my C++ library <
https://github.com/finos/perspective> is going to build and link against
Arrow on the C++ side rather than PyArrow side (because it will also be
consumed in WebAssembly), so I want to avoid also linking against PyArrow’s
copy of the arrow library. I also need to look for PyArrow’s header files,
which might conflict with the version in the local C++ code.

My solution right now is to just assert that PyArrow version == Arrow
version and do some pruning (so I link against local libarrow and PyArrow’s
libarrow_python rather than use PyArrow’s libarrow), but ideally it would
be great if there was a clean way to hand a PyArrow Table over to C++
without requiring the C++ to have PyArrow (e.g. using only a PyObject *).
Please forgive my ignorance/google skills if its already possible!

unwrap_table code:

https://github.com/apache/arrow/blob/c39e3508f93ea41410c2ae17783054d05592dc0e/python/pyarrow/public-api.pxi#L310
<
https://github.com/apache/arrow/blob/c39e3508f93ea41410c2ae17783054d05592dc0e/python/pyarrow/public-api.pxi#L310

library pruning:

https://github.com/finos/perspective/blob/python_arrow/cmake/modules/FindPyArrow.cmake#L53
<
https://github.com/finos/perspective/blob/python_arrow/cmake/modules/FindPyArrow.cmake#L53

Tim

@timkpaine
Copy link
Member

unwrap_array was the original way, but we weren't interested at the time because it required linking against libarrow_python. But since we do that anyway it should be ok
https://arrow.apache.org/docs/python/extending.html#_CPPv412unwrap_arrayP8PyObjectPNSt10shared_ptrI5ArrayEE

@sc1f
Copy link
Contributor Author

sc1f commented Aug 19, 2020

@timkpaine is there a way to do the same in the WASM? I'm personally ok with only being able to load pyarrow Tables in our python binding but for the sake of completeness it would be helpful in WASM

@wesm
Copy link

wesm commented Aug 19, 2020

You should be able to use the C interface now. cc @pitrou @nealrichardson

@timkpaine
Copy link
Member

timkpaine commented Jul 29, 2023

Ok I think this is an example now:
https://github.com/timkpaine/arrow-cpp-python-nocopy

It does both parts in one, but basically the README explains it:

This utilizes Apache Arrow's ABI-stable C interface, which would, for example, allow your combined C++/python extension to compile and link against one version of arrow and be utilized with pyarrow which would otherwise be ABI-incompatible.

Does this seem like the right thing to be doing @wesm @pitrou @nealrichardson @westonpace? AFAIK you need to force your python extension to interact with your C++ library via the arrow C ABI, otherwise your extension might pickup a libarrow from pyarrow that is C++ ABI incompatible with the libarrow your C++ code linked against?

@pitrou
Copy link

pitrou commented Aug 8, 2023

Does this seem like the right thing to be doing @wesm @pitrou @nealrichardson @westonpace?

Yes, it does!

AFAIK you need to force your python extension to interact with your C++ library via the arrow C ABI, otherwise your extension might pickup a libarrow from pyarrow that is C++ ABI incompatible with the libarrow your C++ code linked against?

You are right indeed. This is how we provide zero-copy data sharing between R and Python.

Also @jorisvandenbossche FYI

@davenza
Copy link

davenza commented Oct 7, 2023

@timkpaine (@wesm @pitrou too) I checked your solution in https://github.com/timkpaine/arrow-cpp-python-nocopy, but it is a bit confusing for me and I believe there is something wrong. I would be grateful if you could give me some advice.

I have Arrow C++ 13 installed on my system. I can compile your example with Arrow C++ 13, and then I install it along with pyarrow 13 in a virtualenv. This works fine.

Then, I force the installation of pyarrow 12 in the virtualenv (keeping your example compiled with Arrow C++ 13). In this case, there is a MemoryError: std::bad_alloc in test_create_schema_in_python and a segmentation fault in test_create_schema_in_cpp.

I believe this is what happens:

  • When arrow_python_nocopy is imported in Python, then libarrow.so.1300 is loaded (the library is compiled linking to that version).
  • Then, when pyarrow is imported (import pyarrow as pa), libarrow.so.1200 is imported (from the site-packages/pyarrow).
  • Now, we have two different versions of libarrow.so loaded and this leads to a segmentation fault.

I may be misunderstanding something, but how does your example solve this problem using the C ABI?

In fact, you keep using unwrap_array() from libarrow_python to do the C++ <-> Python conversions (I don't see the C ABI being used in the pybind11 caster).

@timkpaine
Copy link
Member

When arrow_python_nocopy is imported in Python, then libarrow.so.1300 is loaded (the library is compiled linking to that version).

The user library should pull it from pyarrow instead of the system. That repo is me testing stuff, it might have more or less content than is necessary to do this properly.

@davenza
Copy link

davenza commented Oct 7, 2023

Ok, so the library will always be compiled with the same version as pyarrow.

I probably misunderstood it, because I thought that the C ABI could be used to make different versions of Arrow work together (one version for the library and other version for pyarrow).

Anyway, thank you @timkpaine .

@pitrou
Copy link

pitrou commented Oct 7, 2023 via email

@davenza
Copy link

davenza commented Oct 7, 2023

For future reference for other readers who have stumbled upon the same problem, I have been thinking about how we could design an Arrow C++ <-> Python library that is decoupled from the installed pyarrow version, and now I think it is not possible.

Even if we solve the name clashes caused by the different versions of libarrow.so (mangling the symbols with the version, for example), the big problem here is that the ABI is not stable, so the library should be recompiled when the pyarrow version changes.

For example, I could compile a library that connects Arrow version X to pyarrow version Y using the C ABI. However, this compiled library would not work if I change the pyarrow version to Z (until I recompile and link to the Z version). This makes really really hard to decouple the library from the pyarrow version.

@pitrou
Copy link

pitrou commented Oct 7, 2023

the big problem here is that the ABI is not stable,

There is a misunderstanding here. The Arrow C++ ABI is not stable. The Arrow C Data Interface, however, is a stable ABI by design. This is even spelled out in the spec for it:

Goals

  • Expose an ABI-stable interface.

https://arrow.apache.org/docs/format/CDataInterface.html#goals

@davenza
Copy link

davenza commented Oct 7, 2023

Yes, I know that the C ABI is stable, but you need the C++ and Cython ABIs to pass the data between C++ and Python.

As I see it, this is what you need to transform a C++ Arrow array of version X to a pyarrow.Array of version Y (back and forth):

              std::shared<arrow::Array>               (version X, C++ array, libarrow.so.X)
                           ^
                           |
    ImportArray()          |          ExportArray()   (you need this functions from libarrow.so.X)
                           |
                           v
              struct ArrowArray*          (C ABI, this should be compatible in libarrow.so.X and libarrow.so.Y)
                           ^
                           |
    ExportArray()          |          ImportArray()   (you need this functions from libarrow.so.Y)
                           |
                           v
              std::shared<arrow::Array>               (version Y, pyarrow version, libarrow.so.Y)
                           ^
                           |
    unwrap_array()         |          wrap_array()    (you need this functions from libarrow_python.so.Y)
                           |
                           v
              PyObject*                               (version Y, this is a pyarrow.Array in Python)

I hope the diagram is understandable. The functions on the left, transform the data from Python to C++, and the functions on the right do the opposite. As you can see, you need to link two versions of libarrow.so and one version of libarrow_python.so.

If the C++ and Cython ABIs were stable, I could statically link the Arrow C++ version X (using some type of mangling for the version), and then I could use my library with any pyarrow version. Since the C++ and Cython ABIs are not stable, I have to recompile each time I change the pyarrow version to link to the new version.

Of course, I understand the difficulties of creating a stable ABI for C++ and Cython, so I think this is not very realistic (especially in the short-term). However, I am concerned about how Arrow can create a rich ecosystem if two libraries that depend on two different versions of pyarrow cannot coexist in the same virtualenv.

Is there anything obvious that I am not seeing @pitrou? At this point I do not see a winning solution for the libraries that need C++ <-> Python interoperability in the same process.

@timkpaine
Copy link
Member

your c++ library statically links libarrow, you ship a separate python library which dynamically links against libarrow and libarrow_python but does not vendor them. Pyarrow loads its libarrow and libarrow_python, then your python library uses those to create the C api data structures which it hands to your c++ library. My repo doesn't do all this yet.

@davenza
Copy link

davenza commented Oct 8, 2023

@timkpaine I thought about a similar design, but I don't think it will work.

You link your python library dynamically against libarrow and libarrow_python. But, you link against specific versions of libarrow and libarrow_python. If the pyarrow version changes, pyarrow will load a different libarrow.so and libarrow_python.so, and the Python library will crash because there is no ABI stability. I have tested this myself in a little project

Maybe there is a compilation/linker flag I'm not aware of that can solve this. I am very interested to see how your minimal work example could be completed.

@timkpaine
Copy link
Member

ref: apache/arrow#37797

@davenza
Copy link

davenza commented Oct 10, 2023

@timkpaine thank you! I think this solves my problem. I hope this gets merged soon.

@timkpaine
Copy link
Member

@davenza timkpaine/arrow-cpp-python-nocopy#3 removes the need to have pyarrow, and should solve all ABI problems as the linkage between the python layer and the C++ layer is completely separated by the C API layer. I'm only doing a raw CPython binding for now, will do pybind after I finish array and table support.

@timkpaine
Copy link
Member

ok pybind is done too, this is pretty easy and seems correct.

@pitrou sorry to bother you, does this look like the current best practice? timkpaine/arrow-cpp-python-nocopy#3

@pitrou
Copy link

pitrou commented Feb 26, 2024

@timkpaine I see three problems:

  1. pack/unpack will not roundtrip, see [Python] Passing back and forth from Python and C++ with Pyarrow C++ extension and pybind11. apache/arrow#10488 (comment)
  2. that code seems to lack error detection and handling, and might have reference leaks
  3. the no-op capsule destructors (ReleaseArrowArrayPyCapsule and ReleaseArrowSchemaPyCapsule) will lead to massive memory leaks. Why didn't you reuse the code in https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#create-a-pycapsule?

@timkpaine
Copy link
Member

Thanks for looking, sounds like it's good in general, this is just a demo so I just wanted to demonstrate working.

  1. I will adjust to check if it has those OR if it's already a correct capsule
  2. This I'll add later
  3. Laziness during debugging, was getting segfaults but it ended up being that c array returns a tuple and I didn't read close enough at the type annotations.

@dhirschfeld
Copy link
Contributor

From reading this issue, my understanding is that, at present, the only way to pass a pyarrow.Table to perspective is to first convert it to a binary buffer as shown in the original post - is that correct?

@mhkeller
Copy link

For anyone looking for how to do this in Node.js, this seems to work

const table = new arrow.Table(recordBatches);
const ipcStream = arrow.tableToIPC(table, 'stream');
const bytes = Buffer.from(ipcStream, 'utf-8');

If there's a better way in Node to pass the table directly, please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ enhancement Feature requests or improvements JS Python
Projects
None yet
Development

No branches or pull requests

7 participants