-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Implement to WKT for Geometries #788
Conversation
Thanks for the contribution @b4l! I was originally thinking that I think it's beneficial for us to have a custom WKB implementation because WKB use is so widespread in the ecosystem (e.g. GeoParquet) and having a very fast WKB parser and encoder is important. But I think WKT performance is less critical because if someone really cares about performance why are they storing WKT to begin with 😄? So I'm less sure that a custom WKT parser/encoder is worth the maintenance overhead (when the project is already super large). That said, I've been hitting some core issues with geozero, such as in georust/geozero#237, where I think geozero is somewhat fundamentally broken in its handling of Z and M dimensions. So in general I've been moving towards a greater decoupling from geozero where possible. E.g. #779 laid the groundwork for not using geozero for parsing Flatgeobuf geometries and #759 does the same for Shapefile. Any thoughts on this? |
src/algorithm/native/to_wkt.rs
Outdated
fn to_wkt(&self) -> String; | ||
} | ||
|
||
impl<G: GeometryTrait<T = f64>> ToWKT for G { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of actual implementation, for scalars I'd have these as free functions so that they can take geo traits without needing blanket implementations. E.g. look at https://github.com/geoarrow/geoarrow-rs/blob/a613a30d6247aac0a960a033bd0f4ddb42f2cbe3/src/io/geo/scalar.rs.
So we'd have point_to_wkt
:
pub fn point_to_wkt<T: CoordNum>(point: &impl PointTrait<T = T>) -> String {
todo!()
}
and then geometry_to_wkt
would downcast to each individual type like this:
geoarrow-rs/src/io/geo/scalar.rs
Lines 105 to 108 in a613a30
pub fn geometry_to_geo<T: CoordNum>(geometry: &impl GeometryTrait<T = T>) -> geo::Geometry<T> { | |
match geometry.as_type() { | |
GeometryType::Point(geom) => geo::Geometry::Point(point_to_geo(geom)), | |
GeometryType::LineString(geom) => geo::Geometry::LineString(line_string_to_geo(geom)), |
I think that's easier to follow (and test) than a big function that only works on G: GeometryTrait
.
src/algorithm/native/to_wkt.rs
Outdated
if point.dim() == 3 { | ||
wkt.push('Z') | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today the third dimension is always Z, but I want to minimize the amount of code that expects that to be true, because in the future we could support XYM
as well.
So I'd suggest that the writer have a dimension: Dimension
parameter. It should assert that geometry.dim() == dimension.size()
but then match on Dimension
to determine whether to write POINT
, POINT Z
, or in the future POINT M
or POINT ZM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should overhaul the Dimension
enum: dim()
is actually num_dims()
, and the conversion from dimension count is ambiguous. Also, I would call it Dimensions
, rename dim()
to dims()
, and return the enum instead of the size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of today this split between the Dimension
enum and the usize dim()
in the trait is intentional.
In geoarrow-rs we have a split between the physical layout and the semantic meaning of those dimensions. Arrays are parameterized by a const generic D: usize
. But in the future to support XYM
we'll also need metadata about what the physical dimensions represent.
This is separate from geo-traits. The prototype in this repo of multi-dimensional geo-traits is physical only. So it's intentional for the dimension to return a usize
. Ref georust/geo#1157 (comment), georust/geo#1157 (comment), georust/geo#1157 (comment). I'd recommend leaving a comment there in that issue if you would suggest a change to the traits.
The split between physical layout and semantic meaning also made it considerably easier to implement in geoarrow-rs, because usize
is a const generic, whereas I struggled to get things working with Dimension as a generic. (I don't remember the exact reason off hand).
src/algorithm/native/to_wkt.rs
Outdated
fn to_wkt(&self) -> String { | ||
match self.as_type() { | ||
GeometryType::Point(point) => { | ||
let mut wkt = String::from("POINT"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of constructing String
objects, would it be more performant to have a sort of "writer" concept? Every time we call push
or push_str
it allocates a new String right? I'm curious if emulating the Display trait https://doc.rust-lang.org/std/fmt/trait.Display.html might be useful?
Especially because we don't care that much about constructing individual WKT buffers... we want to build up an array of string data, so if we could push all string data from all geometries into a single backing String
/Vec<u8>
, then that would be faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is up to twice as fast as wkt
and geo
. Didn't consider how string arrays are generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fewer allocations, we could build a single Vec<u8>
for the entire string of the entire column, and keep track of the offsets for each row ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we could also reliably estimate the number of characters per generated WKT string, that would also make it more efficient for generating WKT from an entire array of geometries.
For now though, we can just use StringBuilder::append_value
: https://docs.rs/arrow/latest/arrow/array/type.StringBuilder.html#method.append_value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went for the Write
trait in fmt
in d3d586b . It interfaces nicely with Arrows StringBuilders
which can be constructed with capacity so we only allocate once in the best case we know how much.
I think decoupling geozero is desirable as it is modeled for rows, whereas the arrow model is columnar, suggesting it should only be used for reading and writing spatial formats not natively supported (yet). For in-memory operations, I would completely avoid it. Hence having this seems necessary, though not for i/o but rather as a display/debug implementation to use within analytical processes/workflows or displaying SQL query results. |
I agree it's desirable, but just a question of what maintenance cost. I think a custom implementation of writing to WKT would probably be significantly easier than a custom parser implementation.
I agree having display/debug is very valuable, but why not also use this for I/O if we're implementing a WKT writer anyways? |
WKT is not really a format, just a representation, do we really need a writer? I would assume that you just write to a CSV or a text file if you want to persist it. |
I just mean that this WKT writer can be used for more than just debugging and display. So for example we could offer a It would also be good for the bindings to offer a |
GeoDataType::LargeMixed(_coord_type, _dimension) => todo!(), | ||
GeoDataType::GeometryCollection(_coord_type, _dimension) => todo!(), | ||
GeoDataType::LargeGeometryCollection(_coord_type, _dimension) => todo!(), | ||
GeoDataType::WKB => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that this is being refactored in #784, where GeoDataType
is being separated into a NativeType
for all native GeoArrow types and SerializedType
for GeoArrow serialized types (WKB and WKT)
I'd like to get that PR merged within the next couple days
geometry_to_wkt(&wkb.to_wkb_object(), &mut wkt_builder)?; | ||
wkt_builder.append_value(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not exactly sure what how the GenericStringBuilder
handles the implementation of Write
. In particular, we need to ensure that it's not adding a new offset for every time write_str
is called.
It may be easier to maintain the offsets ourselves in our own builder, but that's easy to change later once the WKT writing has stabilized
I think a good place to start with this PR is ensuring we have correct writing of scalars to WKT, and leaving writing of arrays for follow ups. What do you think about like including the tests from geozero's WKT writer to just ensure we have parity for writing scalars? |
let coords = InterleavedCoordBuffer::<2>::new(vec![1., 2.].into()); | ||
let point = OwnedPoint::new(CoordBuffer::Interleaved(coords), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to use crate::test::point::point_array
and then access the first point.
let coords = InterleavedCoordBuffer::<2>::new(vec![0., 0., 4., 0., 2., 4., 0., 0.].into()); | ||
let polygon = OwnedPolygon::new( | ||
CoordBuffer::Interleaved(coords), | ||
OffsetBuffer::<i32>::new(vec![0, 1].into()), | ||
OffsetBuffer::<i32>::new(vec![0, 4].into()), | ||
0, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for all of these. It's way too easy to get these offsets wrong to write them by hand. I'd prefer to improve the test data in a single location (crate::test::*
), and even there all array data should be constructed using From
impls from other structs that impl geometry traits.
let mut wkt = String::new(); | ||
linestring_to_wkt(&linestring, &mut wkt).unwrap(); | ||
|
||
assert_eq!(&wkt, "LINESTRING (1.0 2.0,3.0 4.0,5.0 6.0)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's enough to assert that wkt
matches this string because it's not obvious at a glance that this string is actually valid WKT.
The latest commit reminded me that the implementation of reading/writing WKT is not actually in geozero; it's in https://github.com/georust/wkt. The geozero wrapper on top handles the streaming read/write of WKT and the integration into geozero traits. But Unless there's some reason why we can't use |
To follow up on this, I think it's important to note where geozero does and does not fit in. In particular for export to WKT, it's easy for Arrow to handle columnar strings, so it's easy to use geozero (or It's harder for parsing to geometries because we need to handle the special builders for geometry data compared to string data. |
The main difference to |
Two thoughts:
|
In #791 yesterday I laid the groundwork for parsing WKT from the |
### Change list - Convert geo-traits scalars to `wkt::Wkt` - This is prep work for direct transformation of geoarrow scalars to `wkt::Wkt` objects. - This does not go through geozero, partially because I hit issues with multidimensional writing in georust/geozero#237 - This supersedes #788, though it hasn't yet piped through to a full `ToWkt` trait implemented on arrays. cc @b4l
Some exploration for ToWKT. It might help with #762, but unsure what that entails. I'm just putting it here for feedback on whether this is a viable approach.