Draft: Implement to WKT for Geometries #788

b4l · 2024-09-22T20:21:42Z

Some exploration for ToWKT. It might help with #762, but unsure what that entails. I'm just putting it here for feedback on whether this is a viable approach.

kylebarron · 2024-09-23T12:57:10Z

Thanks for the contribution @b4l!

I was originally thinking that ToWKT would wrap the geozero WKT writer.

I think it's beneficial for us to have a custom WKB implementation because WKB use is so widespread in the ecosystem (e.g. GeoParquet) and having a very fast WKB parser and encoder is important.

But I think WKT performance is less critical because if someone really cares about performance why are they storing WKT to begin with 😄? So I'm less sure that a custom WKT parser/encoder is worth the maintenance overhead (when the project is already super large).

That said, I've been hitting some core issues with geozero, such as in georust/geozero#237, where I think geozero is somewhat fundamentally broken in its handling of Z and M dimensions. So in general I've been moving towards a greater decoupling from geozero where possible. E.g. #779 laid the groundwork for not using geozero for parsing Flatgeobuf geometries and #759 does the same for Shapefile.

Any thoughts on this?

kylebarron · 2024-09-23T13:02:04Z

src/algorithm/native/to_wkt.rs

+    fn to_wkt(&self) -> String;
+}
+
+impl<G: GeometryTrait<T = f64>> ToWKT for G {


In terms of actual implementation, for scalars I'd have these as free functions so that they can take geo traits without needing blanket implementations. E.g. look at https://github.com/geoarrow/geoarrow-rs/blob/a613a30d6247aac0a960a033bd0f4ddb42f2cbe3/src/io/geo/scalar.rs.

So we'd have point_to_wkt:

pub fn point_to_wkt<T: CoordNum>(point: &impl PointTrait<T = T>) -> String { todo!() }

and then geometry_to_wkt would downcast to each individual type like this:

geoarrow-rs/src/io/geo/scalar.rs

Lines 105 to 108 in a613a30

pub fn geometry_to_geo<T: CoordNum>(geometry: &impl GeometryTrait<T = T>) -> geo::Geometry<T> {

match geometry.as_type() {

GeometryType::Point(geom) => geo::Geometry::Point(point_to_geo(geom)),

GeometryType::LineString(geom) => geo::Geometry::LineString(line_string_to_geo(geom)),

I think that's easier to follow (and test) than a big function that only works on G: GeometryTrait.

kylebarron · 2024-09-23T13:30:37Z

src/algorithm/native/to_wkt.rs

+                if point.dim() == 3 {
+                    wkt.push('Z')
+                }


Today the third dimension is always Z, but I want to minimize the amount of code that expects that to be true, because in the future we could support XYM as well.

So I'd suggest that the writer have a dimension: Dimension parameter. It should assert that geometry.dim() == dimension.size() but then match on Dimension to determine whether to write POINT, POINT Z, or in the future POINT M or POINT ZM

We probably should overhaul the Dimension enum: dim() is actually num_dims(), and the conversion from dimension count is ambiguous. Also, I would call it Dimensions, rename dim() to dims(), and return the enum instead of the size.

As of today this split between the Dimension enum and the usize dim() in the trait is intentional.

In geoarrow-rs we have a split between the physical layout and the semantic meaning of those dimensions. Arrays are parameterized by a const generic D: usize. But in the future to support XYM we'll also need metadata about what the physical dimensions represent.

This is separate from geo-traits. The prototype in this repo of multi-dimensional geo-traits is physical only. So it's intentional for the dimension to return a usize. Ref georust/geo#1157 (comment), georust/geo#1157 (comment), georust/geo#1157 (comment). I'd recommend leaving a comment there in that issue if you would suggest a change to the traits.

The split between physical layout and semantic meaning also made it considerably easier to implement in geoarrow-rs, because usize is a const generic, whereas I struggled to get things working with Dimension as a generic. (I don't remember the exact reason off hand).

kylebarron · 2024-09-23T13:33:22Z

src/algorithm/native/to_wkt.rs

+    fn to_wkt(&self) -> String {
+        match self.as_type() {
+            GeometryType::Point(point) => {
+                let mut wkt = String::from("POINT");


Instead of constructing String objects, would it be more performant to have a sort of "writer" concept? Every time we call push or push_str it allocates a new String right? I'm curious if emulating the Display trait https://doc.rust-lang.org/std/fmt/trait.Display.html might be useful?

Especially because we don't care that much about constructing individual WKT buffers... we want to build up an array of string data, so if we could push all string data from all geometries into a single backing String/Vec<u8>, then that would be faster

It is up to twice as fast as wkt and geo. Didn't consider how string arrays are generated.

For fewer allocations, we could build a single Vec<u8> for the entire string of the entire column, and keep track of the offsets for each row ourselves.

If we could also reliably estimate the number of characters per generated WKT string, that would also make it more efficient for generating WKT from an entire array of geometries.

For now though, we can just use StringBuilder::append_value: https://docs.rs/arrow/latest/arrow/array/type.StringBuilder.html#method.append_value

Went for the Write trait in fmt in d3d586b . It interfaces nicely with Arrows StringBuilders which can be constructed with capacity so we only allocate once in the best case we know how much.

b4l · 2024-09-23T18:54:24Z

I think decoupling geozero is desirable as it is modeled for rows, whereas the arrow model is columnar, suggesting it should only be used for reading and writing spatial formats not natively supported (yet). For in-memory operations, I would completely avoid it. Hence having this seems necessary, though not for i/o but rather as a display/debug implementation to use within analytical processes/workflows or displaying SQL query results.

kylebarron · 2024-09-23T19:37:29Z

I think decoupling geozero is desirable as it is modeled for rows, whereas the arrow model is columnar

I agree it's desirable, but just a question of what maintenance cost. I think a custom implementation of writing to WKT would probably be significantly easier than a custom parser implementation.

though not for i/o but rather as a display/debug implementation to use within analytical processes/workflows or displaying SQL query results.

I agree having display/debug is very valuable, but why not also use this for I/O if we're implementing a WKT writer anyways?

b4l · 2024-09-23T20:04:56Z

I agree having display/debug is very valuable, but why not also use this for I/O if we're implementing a WKT writer anyways?

WKT is not really a format, just a representation, do we really need a writer? I would assume that you just write to a CSV or a text file if you want to persist it.

kylebarron · 2024-09-23T20:25:15Z

I just mean that this WKT writer can be used for more than just debugging and display.

So for example we could offer a to_csv method, or something like that, that uses this WKT encoder under the hood. (Just an example, not sure it's worth implementing a to_csv)

It would also be good for the bindings to offer a to_wkt method, which mirrors e.g. shapely.to_wkt.

kylebarron · 2024-09-24T19:08:03Z

src/algorithm/native/to_wkt.rs

+            GeoDataType::LargeMixed(_coord_type, _dimension) => todo!(),
+            GeoDataType::GeometryCollection(_coord_type, _dimension) => todo!(),
+            GeoDataType::LargeGeometryCollection(_coord_type, _dimension) => todo!(),
+            GeoDataType::WKB => {


Just a note that this is being refactored in #784, where GeoDataType is being separated into a NativeType for all native GeoArrow types and SerializedType for GeoArrow serialized types (WKB and WKT)

I'd like to get that PR merged within the next couple days

kylebarron · 2024-09-24T19:17:53Z

src/algorithm/native/to_wkt.rs

+                            geometry_to_wkt(&wkb.to_wkb_object(), &mut wkt_builder)?;
+                            wkt_builder.append_value("");


I'm not exactly sure what how the GenericStringBuilder handles the implementation of Write. In particular, we need to ensure that it's not adding a new offset for every time write_str is called.

It may be easier to maintain the offsets ourselves in our own builder, but that's easy to change later once the WKT writing has stabilized

kylebarron · 2024-09-24T19:21:18Z

I think a good place to start with this PR is ensuring we have correct writing of scalars to WKT, and leaving writing of arrays for follow ups.

What do you think about like including the tests from geozero's WKT writer to just ensure we have parity for writing scalars?

kylebarron · 2024-09-24T21:15:26Z

src/algorithm/native/to_wkt.rs

+        let coords = InterleavedCoordBuffer::<2>::new(vec![1., 2.].into());
+        let point = OwnedPoint::new(CoordBuffer::Interleaved(coords), 0);


I think it would be easier to use crate::test::point::point_array and then access the first point.

kylebarron · 2024-09-24T21:18:04Z

src/algorithm/native/to_wkt.rs

+        let coords = InterleavedCoordBuffer::<2>::new(vec![0., 0., 4., 0., 2., 4., 0., 0.].into());
+        let polygon = OwnedPolygon::new(
+            CoordBuffer::Interleaved(coords),
+            OffsetBuffer::<i32>::new(vec![0, 1].into()),
+            OffsetBuffer::<i32>::new(vec![0, 4].into()),
+            0,
+        );


Ditto for all of these. It's way too easy to get these offsets wrong to write them by hand. I'd prefer to improve the test data in a single location (crate::test::*), and even there all array data should be constructed using From impls from other structs that impl geometry traits.

kylebarron · 2024-09-24T21:23:03Z

src/algorithm/native/to_wkt.rs

+        let mut wkt = String::new();
+        linestring_to_wkt(&linestring, &mut wkt).unwrap();
+
+        assert_eq!(&wkt, "LINESTRING (1.0 2.0,3.0 4.0,5.0 6.0)");


I don't think it's enough to assert that wkt matches this string because it's not obvious at a glance that this string is actually valid WKT.

kylebarron · 2024-09-24T21:25:49Z

The latest commit reminded me that the implementation of reading/writing WKT is not actually in geozero; it's in https://github.com/georust/wkt.

The geozero wrapper on top handles the streaming read/write of WKT and the integration into geozero traits. But wkt is usable on its own.

Unless there's some reason why we can't use wkt directly, I'd be strongly in favor of depending on WKT and not rolling our own wkt writer.

kylebarron · 2024-09-25T13:45:06Z

I think decoupling geozero is desirable as it is modeled for rows, whereas the arrow model is columnar

To follow up on this, I think it's important to note where geozero does and does not fit in. In particular for export to WKT, it's easy for Arrow to handle columnar strings, so it's easy to use geozero (or wkt directly) for this.

It's harder for parsing to geometries because we need to handle the special builders for geometry data compared to string data.

b4l · 2024-09-25T17:29:26Z

The main difference to wkt is that this enables to write geometries directly to a string array. With wkt it is geometry > intermediate representation > string > sink. The former could be enabled through wkt, though it would require some substantial refactoring.

kylebarron · 2024-09-25T18:13:47Z

Two thoughts:

I don't consider writing to WKT absolutely performance critical compared to other operations like reading/writing WKB. For debugging and serializing to slower formats like CSV+WKT, "fast enough" is OK if it simplifies implementation and there's a path for longer term speedups.
With wkt it is geometry > intermediate representation > string > sink

My preferred way to long-term solve this is to get the community to adopt geo-traits. Once there's a common implementation of geo-traits, then both wkt and geoarrow can implement the same traits. Then there will be no intermediate representation on the wkt side.

And for the string part, my read is that since wkt implements writing via the fmt::Display trait, we don't have to materialize the string first. It should be able to use that trait to write WKT strings directly to the sink.

It would actually be awesome to benchmark an implementation of changing the wkt crate to use geo-traits and see how much overhead the intermediate representation actually has. The compiler might be able to simplify a lot of the actual conversion between formats.

kylebarron · 2024-09-25T18:15:53Z

In #791 yesterday I laid the groundwork for parsing WKT from the wkt crate without going through geozero. It might be marginally faster but mainly it means that it's a simpler implementation since we have full access to all the coordinates upfront rather than needing to handle the stream geozero gives us.

@b4l

### Change list - Convert geo-traits scalars to `wkt::Wkt` - This is prep work for direct transformation of geoarrow scalars to `wkt::Wkt` objects. - This does not go through geozero, partially because I hit issues with multidimensional writing in georust/geozero#237 - This supersedes #788, though it hasn't yet piped through to a full `ToWkt` trait implemented on arrays. cc @b4l

Implement to WKT for Geometries

c7f64ab

b4l force-pushed the to-wkt branch from dd82325 to c7f64ab Compare September 22, 2024 20:25

kylebarron reviewed Sep 23, 2024

View reviewed changes

Integrate format trait

d3d586b

kylebarron reviewed Sep 24, 2024

View reviewed changes

Add some tests

e7d145a

kylebarron reviewed Sep 24, 2024

View reviewed changes

This was referenced Sep 24, 2024

Implement writing to WKT using wkt crate #762

Closed

Use ryu for faster float-to-string conversion georust/wkt#118

Open

b4l closed this Sep 25, 2024

b4l deleted the to-wkt branch September 25, 2024 18:35

kylebarron mentioned this pull request Sep 26, 2024

Prep work for to_wkt #795

Merged

kylebarron mentioned this pull request Oct 26, 2024

Implement geo-traits for writing to WKT & perf improvement georust/wkt#124

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Implement to WKT for Geometries #788

Draft: Implement to WKT for Geometries #788

b4l commented Sep 22, 2024 •

edited

Loading

kylebarron commented Sep 23, 2024

kylebarron Sep 23, 2024

kylebarron Sep 23, 2024 •

edited

Loading

b4l Sep 23, 2024

kylebarron Sep 23, 2024 •

edited

Loading

kylebarron Sep 23, 2024

b4l Sep 23, 2024

kylebarron Sep 23, 2024

kylebarron Sep 23, 2024

b4l Sep 24, 2024

b4l commented Sep 23, 2024

kylebarron commented Sep 23, 2024

b4l commented Sep 23, 2024

kylebarron commented Sep 23, 2024 •

edited

Loading

kylebarron Sep 24, 2024

kylebarron Sep 24, 2024

kylebarron commented Sep 24, 2024

kylebarron Sep 24, 2024

kylebarron Sep 24, 2024

kylebarron Sep 24, 2024

kylebarron commented Sep 24, 2024

kylebarron commented Sep 25, 2024

b4l commented Sep 25, 2024 •

edited

Loading

kylebarron commented Sep 25, 2024

kylebarron commented Sep 25, 2024

	pub fn geometry_to_geo<T: CoordNum>(geometry: &impl GeometryTrait<T = T>) -> geo::Geometry<T> {
	match geometry.as_type() {
	GeometryType::Point(geom) => geo::Geometry::Point(point_to_geo(geom)),
	GeometryType::LineString(geom) => geo::Geometry::LineString(line_string_to_geo(geom)),

		geometry_to_wkt(&wkb.to_wkb_object(), &mut wkt_builder)?;
		wkt_builder.append_value("");

		let coords = InterleavedCoordBuffer::<2>::new(vec![1., 2.].into());
		let point = OwnedPoint::new(CoordBuffer::Interleaved(coords), 0);

Draft: Implement to WKT for Geometries #788

Draft: Implement to WKT for Geometries #788

Conversation

b4l commented Sep 22, 2024 • edited Loading

kylebarron commented Sep 23, 2024

Choose a reason for hiding this comment

kylebarron Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

b4l commented Sep 23, 2024

kylebarron commented Sep 23, 2024

b4l commented Sep 23, 2024

kylebarron commented Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Sep 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Sep 24, 2024

kylebarron commented Sep 25, 2024

b4l commented Sep 25, 2024 • edited Loading

kylebarron commented Sep 25, 2024

kylebarron commented Sep 25, 2024

b4l commented Sep 22, 2024 •

edited

Loading

kylebarron Sep 23, 2024 •

edited

Loading

kylebarron Sep 23, 2024 •

edited

Loading

kylebarron commented Sep 23, 2024 •

edited

Loading

b4l commented Sep 25, 2024 •

edited

Loading