Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Implement to WKT for Geometries #788

Closed
wants to merge 3 commits into from
Closed

Conversation

b4l
Copy link
Contributor

@b4l b4l commented Sep 22, 2024

Some exploration for ToWKT. It might help with #762, but unsure what that entails. I'm just putting it here for feedback on whether this is a viable approach.

@kylebarron
Copy link
Member

Thanks for the contribution @b4l!

I was originally thinking that ToWKT would wrap the geozero WKT writer.

I think it's beneficial for us to have a custom WKB implementation because WKB use is so widespread in the ecosystem (e.g. GeoParquet) and having a very fast WKB parser and encoder is important.

But I think WKT performance is less critical because if someone really cares about performance why are they storing WKT to begin with 😄? So I'm less sure that a custom WKT parser/encoder is worth the maintenance overhead (when the project is already super large).

That said, I've been hitting some core issues with geozero, such as in georust/geozero#237, where I think geozero is somewhat fundamentally broken in its handling of Z and M dimensions. So in general I've been moving towards a greater decoupling from geozero where possible. E.g. #779 laid the groundwork for not using geozero for parsing Flatgeobuf geometries and #759 does the same for Shapefile.

Any thoughts on this?

fn to_wkt(&self) -> String;
}

impl<G: GeometryTrait<T = f64>> ToWKT for G {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of actual implementation, for scalars I'd have these as free functions so that they can take geo traits without needing blanket implementations. E.g. look at https://github.com/geoarrow/geoarrow-rs/blob/a613a30d6247aac0a960a033bd0f4ddb42f2cbe3/src/io/geo/scalar.rs.

So we'd have point_to_wkt:

pub fn point_to_wkt<T: CoordNum>(point: &impl PointTrait<T = T>) -> String {
    todo!()
}

and then geometry_to_wkt would downcast to each individual type like this:

pub fn geometry_to_geo<T: CoordNum>(geometry: &impl GeometryTrait<T = T>) -> geo::Geometry<T> {
match geometry.as_type() {
GeometryType::Point(geom) => geo::Geometry::Point(point_to_geo(geom)),
GeometryType::LineString(geom) => geo::Geometry::LineString(line_string_to_geo(geom)),

I think that's easier to follow (and test) than a big function that only works on G: GeometryTrait.

Comment on lines 23 to 25
if point.dim() == 3 {
wkt.push('Z')
}
Copy link
Member

@kylebarron kylebarron Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today the third dimension is always Z, but I want to minimize the amount of code that expects that to be true, because in the future we could support XYM as well.

So I'd suggest that the writer have a dimension: Dimension parameter. It should assert that geometry.dim() == dimension.size() but then match on Dimension to determine whether to write POINT, POINT Z, or in the future POINT M or POINT ZM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should overhaul the Dimension enum: dim() is actually num_dims(), and the conversion from dimension count is ambiguous. Also, I would call it Dimensions, rename dim() to dims(), and return the enum instead of the size.

Copy link
Member

@kylebarron kylebarron Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today this split between the Dimension enum and the usize dim() in the trait is intentional.

In geoarrow-rs we have a split between the physical layout and the semantic meaning of those dimensions. Arrays are parameterized by a const generic D: usize. But in the future to support XYM we'll also need metadata about what the physical dimensions represent.

This is separate from geo-traits. The prototype in this repo of multi-dimensional geo-traits is physical only. So it's intentional for the dimension to return a usize. Ref georust/geo#1157 (comment), georust/geo#1157 (comment), georust/geo#1157 (comment). I'd recommend leaving a comment there in that issue if you would suggest a change to the traits.

The split between physical layout and semantic meaning also made it considerably easier to implement in geoarrow-rs, because usize is a const generic, whereas I struggled to get things working with Dimension as a generic. (I don't remember the exact reason off hand).

fn to_wkt(&self) -> String {
match self.as_type() {
GeometryType::Point(point) => {
let mut wkt = String::from("POINT");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of constructing String objects, would it be more performant to have a sort of "writer" concept? Every time we call push or push_str it allocates a new String right? I'm curious if emulating the Display trait https://doc.rust-lang.org/std/fmt/trait.Display.html might be useful?

Especially because we don't care that much about constructing individual WKT buffers... we want to build up an array of string data, so if we could push all string data from all geometries into a single backing String/Vec<u8>, then that would be faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is up to twice as fast as wkt and geo. Didn't consider how string arrays are generated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fewer allocations, we could build a single Vec<u8> for the entire string of the entire column, and keep track of the offsets for each row ourselves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we could also reliably estimate the number of characters per generated WKT string, that would also make it more efficient for generating WKT from an entire array of geometries.

For now though, we can just use StringBuilder::append_value: https://docs.rs/arrow/latest/arrow/array/type.StringBuilder.html#method.append_value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went for the Write trait in fmt in d3d586b . It interfaces nicely with Arrows StringBuilders which can be constructed with capacity so we only allocate once in the best case we know how much.

@b4l
Copy link
Contributor Author

b4l commented Sep 23, 2024

I think decoupling geozero is desirable as it is modeled for rows, whereas the arrow model is columnar, suggesting it should only be used for reading and writing spatial formats not natively supported (yet). For in-memory operations, I would completely avoid it. Hence having this seems necessary, though not for i/o but rather as a display/debug implementation to use within analytical processes/workflows or displaying SQL query results.

@kylebarron
Copy link
Member

I think decoupling geozero is desirable as it is modeled for rows, whereas the arrow model is columnar

I agree it's desirable, but just a question of what maintenance cost. I think a custom implementation of writing to WKT would probably be significantly easier than a custom parser implementation.

though not for i/o but rather as a display/debug implementation to use within analytical processes/workflows or displaying SQL query results.

I agree having display/debug is very valuable, but why not also use this for I/O if we're implementing a WKT writer anyways?

@b4l
Copy link
Contributor Author

b4l commented Sep 23, 2024

I agree having display/debug is very valuable, but why not also use this for I/O if we're implementing a WKT writer anyways?

WKT is not really a format, just a representation, do we really need a writer? I would assume that you just write to a CSV or a text file if you want to persist it.

@kylebarron
Copy link
Member

kylebarron commented Sep 23, 2024

I just mean that this WKT writer can be used for more than just debugging and display.

So for example we could offer a to_csv method, or something like that, that uses this WKT encoder under the hood. (Just an example, not sure it's worth implementing a to_csv)

It would also be good for the bindings to offer a to_wkt method, which mirrors e.g. shapely.to_wkt.

Comment on lines +36 to +39
GeoDataType::LargeMixed(_coord_type, _dimension) => todo!(),
GeoDataType::GeometryCollection(_coord_type, _dimension) => todo!(),
GeoDataType::LargeGeometryCollection(_coord_type, _dimension) => todo!(),
GeoDataType::WKB => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that this is being refactored in #784, where GeoDataType is being separated into a NativeType for all native GeoArrow types and SerializedType for GeoArrow serialized types (WKB and WKT)

I'd like to get that PR merged within the next couple days

Comment on lines +43 to +44
geometry_to_wkt(&wkb.to_wkb_object(), &mut wkt_builder)?;
wkt_builder.append_value("");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure what how the GenericStringBuilder handles the implementation of Write. In particular, we need to ensure that it's not adding a new offset for every time write_str is called.

It may be easier to maintain the offsets ourselves in our own builder, but that's easy to change later once the WKT writing has stabilized

@kylebarron
Copy link
Member

I think a good place to start with this PR is ensuring we have correct writing of scalars to WKT, and leaving writing of arrays for follow ups.

What do you think about like including the tests from geozero's WKT writer to just ensure we have parity for writing scalars?

Comment on lines +409 to +410
let coords = InterleavedCoordBuffer::<2>::new(vec![1., 2.].into());
let point = OwnedPoint::new(CoordBuffer::Interleaved(coords), 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to use crate::test::point::point_array and then access the first point.

Comment on lines +435 to +441
let coords = InterleavedCoordBuffer::<2>::new(vec![0., 0., 4., 0., 2., 4., 0., 0.].into());
let polygon = OwnedPolygon::new(
CoordBuffer::Interleaved(coords),
OffsetBuffer::<i32>::new(vec![0, 1].into()),
OffsetBuffer::<i32>::new(vec![0, 4].into()),
0,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for all of these. It's way too easy to get these offsets wrong to write them by hand. I'd prefer to improve the test data in a single location (crate::test::*), and even there all array data should be constructed using From impls from other structs that impl geometry traits.

let mut wkt = String::new();
linestring_to_wkt(&linestring, &mut wkt).unwrap();

assert_eq!(&wkt, "LINESTRING (1.0 2.0,3.0 4.0,5.0 6.0)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's enough to assert that wkt matches this string because it's not obvious at a glance that this string is actually valid WKT.

@kylebarron
Copy link
Member

The latest commit reminded me that the implementation of reading/writing WKT is not actually in geozero; it's in https://github.com/georust/wkt.

The geozero wrapper on top handles the streaming read/write of WKT and the integration into geozero traits. But wkt is usable on its own.

Unless there's some reason why we can't use wkt directly, I'd be strongly in favor of depending on WKT and not rolling our own wkt writer.

@kylebarron
Copy link
Member

I think decoupling geozero is desirable as it is modeled for rows, whereas the arrow model is columnar

To follow up on this, I think it's important to note where geozero does and does not fit in. In particular for export to WKT, it's easy for Arrow to handle columnar strings, so it's easy to use geozero (or wkt directly) for this.

It's harder for parsing to geometries because we need to handle the special builders for geometry data compared to string data.

@b4l
Copy link
Contributor Author

b4l commented Sep 25, 2024

The main difference to wkt is that this enables to write geometries directly to a string array. With wkt it is geometry > intermediate representation > string > sink. The former could be enabled through wkt, though it would require some substantial refactoring.

@kylebarron
Copy link
Member

Two thoughts:

  1. I don't consider writing to WKT absolutely performance critical compared to other operations like reading/writing WKB. For debugging and serializing to slower formats like CSV+WKT, "fast enough" is OK if it simplifies implementation and there's a path for longer term speedups.

  2. With wkt it is geometry > intermediate representation > string > sink

    My preferred way to long-term solve this is to get the community to adopt geo-traits. Once there's a common implementation of geo-traits, then both wkt and geoarrow can implement the same traits. Then there will be no intermediate representation on the wkt side.

    And for the string part, my read is that since wkt implements writing via the fmt::Display trait, we don't have to materialize the string first. It should be able to use that trait to write WKT strings directly to the sink.

    It would actually be awesome to benchmark an implementation of changing the wkt crate to use geo-traits and see how much overhead the intermediate representation actually has. The compiler might be able to simplify a lot of the actual conversion between formats.

@kylebarron
Copy link
Member

In #791 yesterday I laid the groundwork for parsing WKT from the wkt crate without going through geozero. It might be marginally faster but mainly it means that it's a simpler implementation since we have full access to all the coordinates upfront rather than needing to handle the stream geozero gives us.

@b4l b4l closed this Sep 25, 2024
@b4l b4l deleted the to-wkt branch September 25, 2024 18:35
@kylebarron kylebarron mentioned this pull request Sep 26, 2024
kylebarron added a commit that referenced this pull request Sep 26, 2024
### Change list

- Convert geo-traits scalars to `wkt::Wkt`
- This is prep work for direct transformation of geoarrow scalars to
`wkt::Wkt` objects.
- This does not go through geozero, partially because I hit issues with
multidimensional writing in georust/geozero#237
- This supersedes #788,
though it hasn't yet piped through to a full `ToWkt` trait implemented
on arrays. cc @b4l
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants