-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write Arrow Table/RecordBatchReader to GDAL #346
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this!
It have wanted to look at this as well, but as you might have noticed in other projects, my time for geo is a bit limited at the moment ;) But so a starting WIP PR is always a good reason to take a look! I hope my comments provide some useful pointers
pyogrio/_io.pyx
Outdated
# Create output fields using CreateFieldFromArrowSchema() | ||
static bool create_fields_from_arrow_schema( | ||
OGRLayerH destLayer, | ||
const struct ArrowSchema* schema, | ||
char** options | ||
): | ||
# The schema object is a struct type where each child is a column. | ||
for child in schema.n_children: | ||
# Access the metadata for this column | ||
const char *metadata = child.metadata | ||
|
||
# TODO: I don't know how to parse this metadata in C... I guess I can just use Python APIs for this in Cython? | ||
# https://github.com/OSGeo/gdal/pull/9133/files#diff-37bedc92ae1d5e04706c7b9f8ea9e9fcccf984ca0c9997e2020ff85f1b958433R1159-R1185 | ||
# if metadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's a pity that GDAL doesn't provide a helper to create a full layer definition from a schema, instead of only field by field (that would us to access the individual children of the ArrowSchema).
We might want to vendor some helpers from nanoarrow-c to extract the children, and the metadata etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing the child is something we can probably do quite easily, as it's something like schema.children[i]
For a first prototype, I think we could also ignore the metadata for a moment, and manually specify which are the geometry columns.
pyogrio/_io.pyx
Outdated
char** options | ||
): | ||
# The schema object is a struct type where each child is a column. | ||
for child in schema.n_children: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for child in schema.n_children: | |
for child in range(schema.n_children): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to change from
- for i in range(schema.n_children):
+ for child in schema.children:
I'm not sure if cython lets you iterate through a list of pointers as if it's a python list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if cython lets you iterate through a list of pointers as if it's a python list?
No idea either, you will have to try and see ;)
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
This now compiles on 3.8! I still need to figure out how to exclude this code from trying to compile on earlier GDAL versions, and I still need to test it. |
Hmm, looking at how it is done on the reading side, it's only the declaration in pxd that we put behind an IF, and then in the actual code we just raise an error at the beginning of the function, but still compile it regardless of the GDAL version (wondering if cython does something smart here) |
Yeah I was confused because I thought I was doing the same process as we have for reading |
I was able to compile and install it locally with
but trying to use it with import pyarrow as pa
from pyogrio._io import ogr_write_arrow
import geopandas as gpd
from geopandas.io.arrow import _geopandas_to_arrow
gdf = gpd.read_file(gpd.datasets.get_path('nybb'))
table = _geopandas_to_arrow(gdf)
new_schema = table.schema.set(4, pa.field("geometry", pa.binary(), True, {"ARROW:extension:name": "ogc.wkb"}))
new_table = pa.Table.from_batches(table.to_batches(), new_schema)
test = ogr_write_arrow('test.fgb', 'test', 'FlatGeobuf', new_table, 'MultiPolygon', {}, {}) is crashing Python. Does anyone have any suggestions for how to debug? I don't know where to go from here. |
I fixed the segfault (gdb was pointing at exporting the schema, so |
Those annoying error return codes ... ;) We need to very carefully inspect every one while writing: the GDAL function Now it is writing actual content to the file. Just not for FlatGeobuf (that gives some error while writing), but for eg GPKG it is working. I also opened #351 to more easily integrate this with all options for setting up the layer/dataset. |
I merged in #351 which simplified this a lot. I'm still struggling with my dev environment though. In particular getting
when trying to |
Do you happen to do that import from within the project directory? Because in that case, you need to install it in editable mode ( The issue is that when importing (or running tests) from the root of repo, python will find the |
Ah yes I've hit this so many times. It's also why compiled rust projects advise to use a different name for the folder containing python code. |
I added some conditional compilation statements around other functions that were trying to use GDAL Arrow struct definitions
so let's see if we can get it to compile with that |
pyogrio/_io.pyx
Outdated
exc = exc_check() | ||
if exc: | ||
raise exc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we actually typically catch the CPLE_..
error and reraise that with one of our generic error classes?
For example I see this pattern:
except CPLE_BaseError as exc:
raise FeatureError(str(exc))
But then what error class to use? (FeatureError
or FieldError
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably DataSourceError
since it is for the batch of features rather than individual features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used DataLayerError following our documentation, as it involves an error with a specific layer ("Errors relating to working with a single OGRLayer") and not with the whole dataset ("Errors relating to opening or closing an OGRDataSource (with >= 1 layers)")
@himikof @brendan-ward thanks a lot for the extensive review! Already addressed a few of the comments, will need to go through the bulk of them one of the coming days |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
Can you please add a test when crs
is not passed in to verify the warning is raised?
If possible, would be good to have that also test (or in a separate test case) if GDAL auto detects CRS from metadata on the geometry column (per the TODO
comment).
Can you please add a test for a driver that doesn't support write? See test_raw_io.py::test_write_unsupported
.
It would probably also be good to probe at close errors on write, similar to test_raw_io.py::test_write_gdalclose_error
.
It might be good to test other drivers similar to test_raw_io.py::test_write_supported
, though it looks like only FGB would be particularly unique there.
If possible, please add tests for append capability supported / unsupported, similar to test_raw_io.py::test_write_append
/ test_raw_io.py::test_write_append_unsupported
Is the idea to add support to write_dataframe
in a later PR? (would be a good idea, this is already getting pretty big)
pyogrio/raw.py
Outdated
if geometry_type is None: | ||
raise ValueError("Need to specify 'geometry_type'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a 3-state variable make sense here, to enable writing without geometry?
""
(or some other arbitrary value to indicate unset): default, indicates not set by user and will raise exceptionNone
: provided by user to indicate there is no geometry to writePoint
...GeometryCollection
: valid geometry type
@@ -304,3 +310,270 @@ def test_arrow_bool_exception(tmpdir, ext): | |||
else: | |||
with open_arrow(filename): | |||
pass | |||
|
|||
|
|||
# Point(0, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also gets used in a variety of places, perhaps this could be added as a function in conftest.py
and then just call it when used below?
def get_wkb_points(count=1):
return np.array([bytes.fromhex("010100000000000000000000000000000000000000")]*count, dtype=object)
We could hook it up to other usages in a different PR, this would just set us up for that while simplifying here.
Added a test for not providing a
Copied and adapted those tests from raw io (and fixed the
Yes, that was my idea. I already have a branch locally that starts testing that (most errors are about the missing support for promoting the geometry type, but which is something for which we could add some custom shapely code in the geopandas -> arrow conversion), but indeed was planning to leave that for a separate PR ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @jorisvandenbossche , this is getting really close. A few minor outstanding comments, but otherwise this looks ready to merge (and we can deal with outstanding issues around encoding - namely not allowing encoding for anything other than shapefiles - in #384 after this is merged)
dataset_metadata, layer_metadata, metadata | ||
) | ||
|
||
# TODO: does GDAL infer CRS automatically from geometry metadata? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the associated test case, it looks like GDAL does not infer this automatically; comment can be removed?
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche @kylebarron ! 🚀
Honestly it was mostly @jorisvandenbossche , but I'm happy to have got the ball rolling! |
A very WIP implementation for writing Arrow. This is mostly to start discussion and is largely inspired/copied from OSGeo/gdal#9133.
Notes:
pyarrow.Table
,pyarrow.RecordBatchReader
, and geoarrow-rs'GeoTable
. TheRecordBatchReader
means that it can handle an iterator of batches, so they don't all have to be materialized at once in a Python Table.Todo:
References:
OGR_L_WriteArrowBatch
Closes #314.