Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow storing points/coordinates as struct arrays in addition to interleaved layout #26

Merged
merged 8 commits into from
Jul 14, 2023

Conversation

paleolimbot
Copy link
Contributor

The comment by @apps4uco in #16 is the latest in a number expressing surprise at the way that the current geoarrow spec suggest that points be stored. While I think that our current spec (xyzxyzxyzxyz) has value somewhere, it's unintuitive to enough everybody I've talked to that I think we should consider changing our initial spec version to (xxxxxx, yyyyyy). Since we started blogging and tweeting about this - largely for the purposes of generating this kind of discussion - all we have had are comments about why we didn't choose (xxxxxx, yyyyyy). I became involved after this initial decision had been made and I think it's the perfect time to change it (i.e., before any implementations actually exist).

The best arguments to store points as Struct <x: [...], y: [...]> by default are:

  • Arrow is a columnar format and storing points rowwise is ideologically opposed to the specification. Practically, Arrow has a lot of tooling around working with data in this form (for example, we don't need any special code to access the x or y values from the inner points array meaning we might be able to implement more operations without custom C or C++ kernels).
  • The child name for the FixedSizeList array might be deprecated in Arrow, meaning we probably shouldn't rely on it to store dimension information (https://issues.apache.org/jira/browse/ARROW-14999).
  • Point geometries are often stored in Arrow format (or any table) as one column for x and one column for y. This is also how point clouds typically store coordinates (to my reading of PDAL's list of reasons why it had to split off from other geometry representations).
  • Kernels that don't care about the z or m values don't have to be rewritten to be used for xyz, xym, or xyzm geometries (the first child is always X and the second child is always Y regardless of dimension). A lot of geometry kernels fall into this category. I'm personally sick of the number of times I've had to parameterize coord_size, z_index and m_index...this really does make writing kernels harder.
  • File implementations like Parquet store min/max values per chunk, I believe even for nested arrays. This means we can more easily use column statistics for predicate pushdown when reading Parquet files.

The only reason I know about to use the current spec ([x y, x y, x y]) is that it more closely matches how WKB does it and that it uses fewer buffers. This isn't to say it doesn't have any other advantages, but I think there's enough anecdotal evidence to put the burden of proof the other way around: if storing coordinates as [x y, x y, x y] in an Arrow array has value, its value must (in my opinion) be proved.]

@kylebarron
Copy link
Member

Thanks for making the PR!

Given #16 (comment) I wasn't expecting the PR to switch to mandating storing data as structs. I'll have to keep thinking about getting rid of the fixed size list entirely.

I'd love to hear more about the FlatGeobuf community's experience with their layout, which stores xy interleaved as one array and then stores z, m, and t as optional separate arrays.

i.e., before any implementations actually exist

Well, might be a little late for that. It's already in cuspatial and GDAL at least.

it more closely matches how WKB does it

In my mind, I had thought most implementations' low-level memory storage was interleaved, but maybe I'm wrong, and my mind is colored by higher level constructs like WKB and GeoJSON.

As one data point, deck.gl is entirely built upon interleaved xy and xyz buffers (maybe related to @bmschmidt's point here). And so deck.gl would need to make a copy to interleave arrow memory into a single buffer before it can copy to the GPU. There's actually a small deck.gl conference this week where I'm presenting on using it with GeoArrow (with FixedSizeLists 😅) and looking forward to feedback from longer-term maintainers than myself on GeoArrow and whether handling separated buffers would indeed require rewriting all the WebGL code.

format.md Outdated
[variable sized list](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout)
arrays. In practice, this means we have additional arrays storing the offsets
that denote where a new geometry, a new geometry part, or a new polygon ring
starts.

**Point**: `FixedSizeList<double>[n_dim]`
**Point**: `Struct<x: double, y: double, [z: double, [m: double>]]`
Copy link

@bmschmidt bmschmidt Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, is it necessary for m to be a double at all? For example, if I have a Line representing the path of a ship on the ocean, I might want want m to be a Date64.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or more ambitiously but still grounded in the actual example I have in mind, I might want m to be struct<time: Date64, windspeed: float64, winddirection: dictionary, notes: string>. Or I guess there's t as well. I might also just be asking for arbitrary keys in the array where only x, y, and z (?or not z?) are strictly typed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me a while to respond here, but this helped me a lot with the wording around why multiple options were allowed (and why more may be considered in the future). A more realistic short-term perspective on this is that if you have an array that looks like struct<lon: float64, lat: float64, time: Date64, windspeed: float64, winddirection: dictionary, notes: string>, you can zero-copy cast that to a geoarrow.point<struct<lon: float64, lat: float64>>, from which it would be zero copy to create a linestring and be passed through geometry functions.

@bmschmidt
Copy link

I said a bunch in #16 on both sides. But FWIW my gut feeling is that @paleolimbot is right and Struct <x: [...], y: [...]> is the right call for geo-arrow. The brilliant idea in opengeospatial/geoparquet#13 about getting a spatial index completely for free, just by sorting, makes me think there might be all kinds of unforeseen benefits from the larger arrow/parquet ecosystem of using a struct rather than a fixed size list. Shoving WKB conventions into Arrow, on the other hand, feels janky.

This may be opening a can of worms worth leaving closed, but I also wonder whether a benefit here is that it means it's no longer necessary to mandate any type on m in an xyzm formatted field. One of the uses I've been thinking about involves timestamps for m, which doesn't seem crazy. But I'm not versed enough in geodata to really know what m is usually used for. I just made a related comment here

@jorisvandenbossche
Copy link
Contributor

Given #16 (comment) I wasn't expecting the PR to switch to mandating storing data as structs.

It's pointing to a comment of mine, so .. but indeed, I would personally argue for making this inner point representation more flexible, and allow both. Potentially at the cost of making the spec more complex and harder to implement/support. But there are good arguments for using one of both depending on the use case.

And I want to be clear: the reason we initially decided to go with interleaved was not driven by WKB (as far as I remember ..), but because most of the related implementations we were aware of at the time actually use interleaved for their in-memory data (cuspatial, datashader, deck.gl, GeoMesa, also GEOS).
(and I think the clear mapping to GeoJSON is actually more interesting than the relation with WKB)

@kylebarron
Copy link
Member

(and I think the clear mapping to GeoJSON is actually more interesting than the relation with WKB)

💯 I do appreciate that I can try to explain the nested list approach as being "logically" exactly the same as GeoJSON coordinates while having a dual API that's really efficient to operate on

@jorisvandenbossche
Copy link
Contributor

jorisvandenbossche commented Sep 19, 2022

But there are good arguments for using one of both depending on the use case.

Expanding on this, it seems that, generally speaking, many of the arguments for separate x/y fields are related to storage and querying (it compresses better, it integrates better with Parquet and its indexing system, efficiently bounding box queries, ...), while libraries often uses interleaved x/y for their in-memory data for processing and visualization purposes.

And both those broad uses cases are important to consider for GeoArrow.

As I mentioned above, it certainly comes at the cost of making the specification more complex and harder to implement and support, but so one option is to allow both the interleaved (with FixedSizeList as inner point level) vs separate x/y fields (with Struct as inner point level). So that the user can choose the representation that gives the best trade-off for their use case (or the library can convert any incoming data in geoarrow format to that representation it uses internally).

Another option to allow this flexibility, somewhat combining both of those (and which was already mentioned on the side in the original discussion at #4 (comment)), could also be to always use a Struct as the inner point level, but to allow the field of that struct to be an interleaved xy FixedSizeList. This would allow something like Struct<xy: FixedSizeList<double>[2], z: double> (similar to FlatGeoBuf, easier for consumers that want to be able to ignore z values), and would also allow an "m" field with a different data type (as raised by @bmschmidt above) in combination with interleaved xy. And it can then of course also allow Struct<x: double, y: double> for separate fields.

For the simple case of 2D interleaved xy data, this Struct<xy: FixedSizeList<double>[2]> (instead of just FixedSizeList<double>[2]) would be one additional level in the (logical) structure. But memory-wise for the actual buffers, this makes no difference. Both representations have the same buffers, with the one exception that the Struct can have another null bitmap at the struct level (for all fields together). Although in practice, we said in the current specification that only the outermost level is allowed to have nulls (and thus in theory those bitmaps for inner levels don't need to be allocated).
One additional benefit of using a Struct with one named field ("xy" or "xyz") for those cases is that it would also solve the potential issue with names of lists not being preserved (second bullet point in @paleolimbot's top post), while names of struct fields can certainly be relied upon.

@bmschmidt
Copy link

Just to play devil's advocate:

  1. I still get confused every once in a while by whether geojson is ['lat', 'long'] or ['long', 'lat'], so I'm happy to lose that:
  2. One of the great things about Arrow is that you can be confident that operations are going to be zero-copy, especially on memmapped data. Isn't there a significant cost to a dual specification where some operations may or may not trigger allocating huge new buffers depending on the upstream data source?

@paleolimbot
Copy link
Contributor Author

Lots of good stuff here, thanks!

First, I hope this PR comes across as convivial...it is! I've implemented this both ways in R, both ways in C++, and for a brief time I implemented a version that let you do both (R and C++, "do" meaning create and export Arrays from WKB and WKT plus a few kernels). I totally can and totally will do it all again! The real gem here is Arrow and it's important not to loose sight of that.

Second, on the topic of "why not allow both". I did this in the R package and the prototype C++ library and it's horrible to write generic code for. You have a distinct type/operation for every geometry type, for every inner point representation and possibly for every dimension combination depending on what you're up to. Templating helps but you still end up with a god-awful factory file listing all the combinations if you want to do anything generic (in this case, I was writing WKT and WKB readers and writers). There are some operations that are zero copy with the Struct representation; there some operations that are zero copy with the FixedSizeList representation. If both are treated equally in this specification, there will be a lot of copying resulting from those two competing representations. We need to pick one and roll (one can and should provide fast conversions to other representations, just not declaring those types as a "geoarrow.whatever").

Third, on the representation of points in other libraries. There are a lot of libraries that use xyxyxyxyxy; however I'm not convinced that we get more zero-copy operations out of the box using one or the other. For every deck.gl that can zero-copy xyxyxyxy (as long as there's no z or m and as long as you don't want to subset your data beforehand and as long as you don't need to do any coordinate transforms before that and as long as...), there's an R graphics device that can zero-copy xxxxxx yyyyyy. In R points are often represented as columns in a data frame or a 2-column matrix, both of which can zero-copy to an Arrow struct; with WKB you can memcpy xyxyxyxy. With GEOS/S2, there is no zero copy anything (although there could be if there's a spec and they implement it, which is more likely if we have only one point representation). At least according to its README, cuspatial previously represented its geometries as "Structure of Array", so I wonder if the current representation is a recent change or one that masks some copy that happens immediately after passing the array over to cuspatial. Even if a library does represent points zero copy with a FixedSizeList, it's only zero copy if there's no transformation beforehand (e.g., CRS transform) and if the dimensions match (e.g., plotting a POINT Z in plan view).

Fourth, @bmschmidt in particular highlighted that a Struct gives us more flexibility. We can name our fields "longitude" and "latitude"; "northing" and "easting"; we can include a bool field to properly reprsent EMPTY points if we want! There's no need to represent a timestamp as a double if we don't want there to be (although for compatibility with literally anything else I think they should really all be doubles...in @bmschmidt's example of a Date64, though, it would be a cheap cast from Struct<x, y, actual_date> to GeoArrowPointStruct<x, y, m> with the proper CRS metadata defining how the time dimension is encoded as a double). With a FixedSizeList we get one common data type and a child name that might get changed or ignored in a future version of Arrow (which mostly just means we have to lean harder on extension metadata, which would probably be fine).

Fifth, it's just plain easier to write geometry operations with a Struct representation. This applies to high-level runtimes like Python, where you can use Arrow compute kernels or zero copy numpy magic to do min() or max() or affine transforms. This also applies to low-level runtimes, where it's (slightly) easier to write code that ignores the z and m dimensions (e.g., like 90% of the GEOS C API). The crux of that is that with a fixed size list you need a coord_size to index into the coordinates array, whereas the other way you can just pass double* x, double* y regardless of what the dimension of your geometry is...not the end of the world or anything, just easier. (Both are easier than writing two versions of that function for a Struct and a FixedSizeList).

Sixth, the embedded column statistics in Parquet thing is huge. Arrow is in-memory but the tools around it are designed to scale to bigger than memory and the geospatial ecosystem (in my biased opinion) could really use them! Pushing a potential intersection filter into a parquet read operation is a big deal and I think it's fine to lean into that.

@thomcom
Copy link

thomcom commented Sep 20, 2022

There's tons of great material already in this thread. I can contribute one of the GPU developers' motivations for originally designing the specification as xyxyxy instead of xxx, yyy, which is cache coherency. In order to squeeze the maximum throughput out of many parallel GPUs, keeping the data adjacent to one another reduces cache misses considerably. We considered that the majority of implementations would be focused on 2d.

Storing the interleaved coordinates also improves coalesced memory access. Kernel threads that use x,y pairs together will always benefit from their adjacency, whereas if they are separated by long blocks of other x and ys performance will take a hit.

I don't have any on-hand benchmark numbers to demonstrate these design decisions, but they are why we suggested interleaved coordinates.

@thomcom
Copy link

thomcom commented Sep 20, 2022

Oh, @harrism also explains that we wanted vectorized memory access using xy interleaving https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

@harrism
Copy link

harrism commented Sep 21, 2022

(i.e., before any implementations actually exist)

cuSpatial's implementation actually exists, right @thomcom?

@mdsumner
Copy link

mdsumner commented Sep 21, 2022

I definitely agree with columnar xxxxx, yyyyy - never seen any compelling reason to align to rgl's arrangement xyzxyzxyz or xyzhxyzh ... in years of using it 🙏 - everything i did there was column oriented until time came to materialize mesh3d or vis (and, so much activity is about modifying verts)

@paleolimbot
Copy link
Contributor Author

To avoid more arm-waving on my part, I wrote a post with some downloadable example data in the two forms ( https://dewey.dunnington.ca/post/2022/profiling-point-representations-in-geoarrow/ ). I think there's enough concrete benefits there to consider at the very least adding support for the Struct encoding as an option. Notably, with a Struct encoding:

  • We get out-of-the box support for Parquet column statistics
  • It's faster to read points (1.5x in my example), although less dramatic with more levels of nesting (1.1x faster for a multilinestring)
  • We can use Arrow's compute engine for general coordinate computations (min/max and affine transform are probably the most useful here). Arrow's min/max kernel is twice as fast as a pure C version with no virtual method calls.

The "hard to support both" claim is just me arm-waving at this point...if and when I have a concrete example of this being difficult is the right time to bring that up. The previous example of my templating may just have been bad library design (I plan on trying again soon).

cuSpatial's implementation actually exists

Is there a good way for me to use cuSpatial to avoid any more ignorance on my part? I'm excited for GPU geometry but don't have the hardware to test it!

I don't have any on-hand benchmark numbers to demonstrate these design decisions

This would help me and others understand the motivation and would make great documentation so that people like me don't try to pull something like this again.

@thomcom
Copy link

thomcom commented Sep 21, 2022

We do have some cloud solutions you can play with cuspatial on. Let me call on @taureandyernv to explain the latest state-of-the-art cloud solution. As for benchmarking, I'll talk to @harrism and @isVoid about what we can produce.

@dmitrykoval
Copy link

dmitrykoval commented Sep 22, 2022

Just a couple more points for non-interleaved representation. I'm currently prototyping the Geography type extension for DuckDB, using S2 as a geom lib. From this perspective, storing coordinates as xxx, yyy makes much more sense and would make integration with Geoarrow much easier.

Here's the DuckDB representation proposal of the Geography data type. It maps naturally to DuckDB Struct/List vector types, which are stored as columns internally.
Reading WKT is not really complex, given this non-interleaved representation. For example: WKT Reader
For S2 in-memory objects instantiation I would think it would not make much of a difference whether the input is interleaved or non-interleaved, as long as both of approaches have the same cache access pattern.

From the storage and serialization/deserialization perspective, the data that exhibits spatial locality (like probe data, road network, admin boundaries) would benefit from non-interleaved xxx, yyy storage due to much better compression. I don't have a benchmark handy, but it's not hard to quickly write one.

Finally, @paleolimbot's point about parquet storage and making use of statistics make perfect sense, when coordinates are stored separately, scan filters may be used for pruning with range queries like (lat >= a and lat < b), relying on column stats.

@paleolimbot
Copy link
Contributor Author

Compression has been mentioned a few times, and it's a very testable one!

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(purrr)
library(ggplot2)

options <- expand.grid(
  type = c("points", "multilines"),
  storage = c("xxyyzz", "xyzxyz"),
  file_format = c("parquet", "arrow"),
  compression = c("uncompressed", "snappy", "gzip", "bz2", "lz4", "zstd", "brotli"),
  stringsAsFactors = FALSE
)

tables <- pmap(unique(options[c("type", "storage")]), ~{
  file_in <- glue::glue("~/Downloads/{..1}-{..2}.parquet")
  read_parquet(file_in, as_data_frame = FALSE)
})
names(tables) <- do.call(paste, unique(options[c("type", "storage")]))

compressed_size <- function(type, storage, file_format, compression) {
  table <- tables[[paste(type, storage)]]
  tf <- tempfile()
  on.exit(unlink(tf))
  
  tryCatch({
    if (file_format == "parquet") {
      write_parquet(table, tf, compression = compression)
    } else {
      write_feather(table, tf, compression = compression)
    }
    
    as.integer(file.size(tf))
  }, error = function(...) NA_integer_)
}

options$file_size <- pmap_int(options, compressed_size)

options <- options[is.finite(options$file_size), ]

ggplot(options, aes(x = type, y = file_size / 2 ^ 20, fill = storage)) +
  geom_col(position = "dodge") +
  facet_grid(compression ~ file_format) +
  labs(y = "file size (MB)")

While the struct representation saves a few MB here and there, I don't think compression is an important consideration in this particular decision. (Or perhaps the dataset I'm using is not a good one to illustrate the advantage).

@bmschmidt
Copy link

I benchmarked compression in the other issue and found, to my surprise:

At low compression levels (snappy and zstd compression level 1) struct-formatted points take up marginally less space, but at high levels of compression interleaved points compress down better. Not enough to be a major issue.

I'd guess perhaps there is some mutual information between x and y in most datasets that interleaved reps can leverage?

@kylebarron
Copy link
Member

kylebarron commented Sep 22, 2022

I don't think compression is an important consideration in this particular decision

Quick thought on this: I wonder if the choice of encoding is really important here. I assume that the default encoding in the R API is plain, which doesn't do any pre-encoding before compression. I want to believe that using DELTA_BINARY_PACKED on sorted geometries would actually improve compression in the struct case vs interleaved. Delta encoding write support is not implemented in Arrow C++, but it is in Rust Arrow2. Delta encoding doesn't work on floats so this is totally theoretical for fixed precision geometries (just trying to figure out if separated could provide significantly better compression in some instances). I'll try to get a minimal example to test with

@paleolimbot
Copy link
Contributor Author

I'm wondering how I should proceed here: I could update this PR to basically include two copies of the spec (one for fixed size list, one for struct) and note that we're investigating the performance tradeoffs? I could leave this PR open? I'd like to start in on a C/C++ implementation in earnest in a few weeks based on nanoarrow and it would be nice to have this discussion resolved (although I could also implement both again...I'm not scared!).

@thomcom Has there been any movement on benchmarks? It seems like better performance on GPUs is one of the remaining arguments for keeping the fixed size list and I don't have the ability to help with that.

@kylebarron
Copy link
Member

include two copies of the spec (one for fixed size list, one for struct) and note that we're investigating the performance tradeoffs

This would be my vote. Make clear that both versions are "incubating"/"in testing", that we don't know now which to choose, but the end specification is expected to only have one to maximize ease of interoperability.

@apps4uco
Copy link

Thanks for considering my input on the thread.

I want to discuss the comments above

There's tons of great material already in this thread. I can contribute one of the GPU developers' motivations for originally designing the specification as xyxyxy instead of xxx, yyy, which is cache coherency. In order to squeeze the maximum throughput out of many parallel GPUs, keeping the data adjacent to one another reduces cache misses considerably. We considered that the majority of implementations would be focused on 2d.

Storing the interleaved coordinates also improves coalesced memory access. Kernel threads that use x,y pairs together will always benefit from their adjacency, whereas if they are separated by long blocks of other x and ys performance will take a hit.

I don't have any on-hand benchmark numbers to demonstrate these design decisions, but they are why we suggested interleaved coordinates.

Im not an expert in GPU however I believe that the SoA approach x[], y[], z[], m[] etc should not be a problem for the GPU, because as I understand the memory access should be coalesced, as each thread in the warp will be reading x[i] y[i] z[i].

Quoting from Robert Crovella https://stackoverflow.com/users/1695960/robert-crovella https://stackoverflow.com/questions/37920862/how-many-simultaneous-read-instructions-per-thread-on-modern-gpu

"A thread can issue multiple reads (on separate instructions in the instruction stream). Multiple reads from a single thread can be in flight. A read request by itself does not stall the thread. Only when the results of that read are required for a future operation will it stall (if the read data is not available.) The compiler is aware of this and will attempt to schedule reads before the data are actually required. In such scenarios, multiple reads (from a single thread or warp) can be in flight."

There are multiple memory controllers on a modern GPU and I believe that each can be processing a memory request concurrently. So I believe there should be no or very little performance penalty for SoA vs AoS in this case.

On the other hand, for the AoS pattern xyzxyzxyz on a CPU it is much more complicated to do efficient SIMD , as extra instructions are needed to deinterleave the data to separate all the x to one SIMD register all the y to another SIMD register etc and then do the computation and then interleave them again to write the result. The extra instructions to deinterleave and interleave complicate writing the program (also as they are different depending on the SIMD architecture) and decrease perfomance

Id like to propose that we do some simple benchmarks. maybe something like
just taking raw float arrays as input and calculating
r=sqrt(xx+yy) or even just x+y on the CPU and GPU and see what the results are.

Also with regard to the compression it depends on the CRS if the data is lat and lon then each delta will also be a float and difficult to compress, however if the data is in a planar coordinate system, then each x y usually represents meters and it may be possible to represent them as integers and get much better delta encoding.

Again Im not an expert (just adding my 2 cents) Please let me know if Im mistaken in my analysis.

@harrism
Copy link

harrism commented Sep 28, 2022

@thomcom Has there been any movement on benchmarks? It seems like better performance on GPUs is one of the remaining arguments for keeping the fixed size list and I don't have the ability to help with that.

I'm trying to find an existing benchmark so we don't have to reinvent the wheel. Watch this space.

Im not an expert in GPU however I believe that the SoA approach x[], y[], z[], m[] etc should not be a problem for the GPU, because as I understand the memory access should be coalesced, as each thread in the warp will be reading x[i] y[i] z[i].

You are correct, however the ideal for GPUs to help saturate memory bandwidth is actually sort of "SoAoS", where the final S is "small vector structures". In other words, each thread reads not a single float, but a float2 or float4 vector. (Or in other other words each warp reads not 128 bytes but 256 or 512 bytes).

Note we can easily cast an array of interleaved xyxyxy to an array of float2 structs and achieve vectorized loading, but pure columnar SoA data requires a separate load for x and y.

See https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

That said, we are not talking about 2x performance improvement here, more like 10-20% for bandwidth-bound kernels. Compute bound kernels are not likely to experience any difference (but real applications that are not bandwidth bound are rare in the wild).

just taking raw float arrays as input and calculating
r=sqrt(xx+yy) or even just x+y on the CPU and GPU and see what the results are.

I suggest not the sqrt one as that will be more compute bound. What I would test is a simple kernel where the input and output are both x-y coordinates, and the computation is trivial, e.g. scale coordinates by a parameter. One version loads from separate x and y arrays, one loads from a single interleaved array.

@mdsumner
Copy link

mdsumner commented Sep 29, 2022

just one minor comment from me:

however if the data is in a planar coordinate system, then each x y usually represents meters

You can always encode coordinates as integers by scaling, it doesn't matter what the units are (they might be mm ...) it depends on the domain and the extent they represent, and what and where counts as "precise" or "accurate" is entirely independent of the crs, every coord sys has regions where its particular properties can't hold.

(Loving the discussion, thanks!)

@paleolimbot
Copy link
Contributor Author

I updated the text of this PR to support both! This topic can and should get more discussion outside the realm of this one PR.

A notable thing for me that came out of this discussion is that Struct<double, double, ...> is by no means the limit to what people might want to do with Arrow to optimize for various use-cases (e.g., Struct<int, int, ...>, Struct<double, double, Time64>, FixedSizeList<float>). I added some text to allow for that eventuality as well...it may be that supporting multiple point types is really just not that big of a deal for adoption.

format.md Outdated Show resolved Hide resolved
@paleolimbot
Copy link
Contributor Author

Just a note that https://github.com/geoarrow/geoarrow-cpp and its Python bindings ( https://github.com/geoarrow/geoarrow-cpp/tree/main/python ) have progressed enough that you should be able to get just about any dataset into either coordinate representation discussed here. An example:

# python -m pip install "https://github.com/geoarrow/geoarrow-cpp/archive/refs/heads/main.zip#egg=geoarrow&subdirectory=python"
import geoarrow.pyarrow as ga
import geopandas

url = "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_basin_line.gpkg"
df = geopandas.read_file(url)
wkb_array = ga.array(df.geometry)

type = ga.multilinestring().with_dimensions(ga.Dimensions.XYZ)
type_interleaved = type.with_coord_type(ga.CoordType.INTERLEAVED)

wkb_array[:2].as_geoarrow(type).geobuffers()
[None,
array([0, 1, 2], dtype=int32),
array([   0,  405, 1095], dtype=int32),
array([648686.0197, 648626.0187, 648586.0197, ..., 665186.02  ,
       665166.02  , 665166.02  ]),
array([5099181.9841, 5099181.9841, 5099161.9831, ..., 5138601.9825,
       5138601.9825, 5138641.9825]),
array([0., 0., 0., ..., 0., 0., 0.])]
wkb_array[:2].as_geoarrow(type_interleaved).geobuffers()
[None,
 array([0, 1, 2], dtype=int32),
 array([   0,  405, 1095], dtype=int32),
 array([ 648686.0197, 5099181.9841,       0.    , ...,  665166.02  ,
        5138641.9825,       0.    ])]

@paleolimbot
Copy link
Contributor Author

@thomcom I recently figured out you can get a GPU + cuSpatial up an running on Google colab and couldn't resist taking it for a spin. Related to the point representation question, it seems that cuDF doesn't have a structure for a fixed-size list? Does reading Parquet directly into GPU memory via cudf.read_parquet() allocate an extra offsets buffer in this case or does it use some kind of memory mapping to avoid that memory overhead?

I see that the custom shapefile reader was deprecated + removed although it is/was useful for demonstrating cuSpatial utility internally: the geopandas read + copy to GPU operation is slow despite cuSpatial's computation being very fast. I imagine GeoParquet + GeoArrow encoding will be useful here and I wonder if it would save any trouble if the encoding used cuDF native types?

@harrism
Copy link

harrism commented Apr 18, 2023

Regarding the shapefile reader: it was very limited (single polygons only, no multipolygons, no points, no linestrings, etc.) and depended on GDAL. This dependency was making it hard for us to support installation from pip, so we decided to remove it for now. We do have tentative plans for providing GPU-accelerated Geo I/O with an effort starting in the near future (23.08 release most likely). We think that this will be a much better way forward.

I'll let @thomcom and / or @trxcllnt answer regarding fixed-size lists in cuDF.

@thomcom
Copy link

thomcom commented Apr 18, 2023

Hey @paleolimbot ! You are correct that cudf doesn't have a fixed size list implementation. IIRC we worked around it by using an extra level of list for each of the types: https://github.com/rapidsai/cuspatial/blob/branch-23.06/python/cuspatial/cuspatial/io/pygeoarrow.py . My presumption is that with read_parquet we indeed allocate another list offsets buffer for any fixed-size list.

I'm quite eager to get an integrate with GeoParquet + GeoArrow so I've been watching your work. :) Agree that the GeoPandas integration is very slow, we're planning on starting dedicated I/O work soon.

@paleolimbot
Copy link
Contributor Author

I'm hoping to get some big example files up soon! I was massively impressed by the cudf read parquet + cuSpatial quadtree point-in-polygon join (I got it up to 130 million points before running out of memory).

@Maxxen
Copy link

Maxxen commented Apr 18, 2023

I'm hoping to get some big example files up soon! I was massively impressed by the cudf read parquet + cuSpatial quadtree point-in-polygon join (I got it up to 130 million points before running out of memory).

Tangentially related, but are there some sort of standardised spatial dataset/benchmark queries? Or is this something that would be interesting to draft together? It would be great to have something like a TPC-H/S schema that could be deterministically generated for arbitrary scales where the spatial columns have properties (point density, avg line length, polygon hole-to-shell ratio) similar to those found in real/common datasets (osm, tiger, copernicus).

@paleolimbot
Copy link
Contributor Author

It would be super nice if that existed (or maybe it does and I don't know about it). I've collected a few things with appropriate licenses here: https://github.com/paleolimbot/geoarrow-data and would like to put some things here: https://github.com/geoarrow/geoarrow-data .

The 130 million points I mentioned were the all the buildings in the Microsoft US buildings dataset and the join I mentioned was against a shapefile of all U.S. zipcodes. Nationwide road segments and rivers might be nice as well (with a single small state...maybe Vermont...to use as a small-scale trial).

@thomcom
Copy link

thomcom commented Apr 19, 2023

We've done a lot of work on spatial benchmarks and would be happy to collaborate on designing a schema for them.

@kylebarron
Copy link
Member

If we want to add support for struct arrays in addition to fixed size list arrays, should they have a different Arrow extension name? I suppose you'd be able to inspect the child types to tell which encoding the array is using, but it might be easier if it's in the extension name?

@paleolimbot
Copy link
Contributor Author

I could go either way on that: putting more information in the extension name is not really a problem but you also have to inspect the structure anyway to make sure the storage type is correct. A case that's come up where extra information in the extension name might help is for XYM geometries since not all Arrow implementations are able to create field names for a list.

In case you're curious what "inspect the child types" looks like in C, the current geoarrow-c implementation for both interleaved and struct-based schemas is here: https://github.com/geoarrow/geoarrow-cpp/blob/main/src/geoarrow/schema_view.c#L9 and tests with all the fun ways you can specify invalid schemas are here: https://github.com/geoarrow/geoarrow-cpp/blob/main/src/geoarrow/schema_view_test.cc#L159 .

@paleolimbot
Copy link
Contributor Author

Ok! A bunch of data is available in all the geoarrows (wkb, interleaved, struct): https://github.com/geoarrow/geoarrow-data .

Feel free to open issues on that repo to add other data sets! (For example, whatever data set cuspatial uses or generates would be welcome!)

@jorisvandenbossche
Copy link
Contributor

Thanks a lot for everybody involved here for the interesting discussion and the many arguments, examples and benchmarks brought up!

This has been moving slowly (and apologies from my part about not helping a lot in moving it forward lately), but let's try to change that now! I think there is clear consensus that the struct array representation has several advantages and that we want to at least enable this representation. Based on my reading, I think there is also general consensus that it is acceptable to (at least short term) do this by allowing both representations (and the PR has already been updated to reflect that). That way, we enable using the struct representation, while still keeping the interleaved representation to support those cases where it is already being used (e.g. to support passing it zero-copy using geoarrow between libraries that both use interleaved).
It does make the specification more complex (although I think it is perfectly fine that libraries that implement compute only support one of both implementations, as long as they can convert upon consumption), but I think that is unavoidable given that both representation are already widely used and each have their advantages, if we want to allow interop and moving data between libraries as cheap as possible for most cases.

So the concrete proposal is to get the current PR merged with the proposal to describe both interleaved and struct as options for the representation of coordinates.
There is the remaining question raised by @kylebarron above at #26 (comment) about how to name the extension types (same name with different storage type vs different names). But I would propose to discuss that further in another issue/PR, and focus this PR on just describing the memory layout.

Copy link
Contributor

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments. Thanks for writing this up @paleolimbot!

format.md Outdated Show resolved Hide resolved
format.md Outdated Show resolved Hide resolved
format.md Outdated
@@ -103,6 +124,17 @@ is a list of xy vertices. The child name of the outer list should be "polygons";
the child name of the middle list should be "rings"; the child name of the
inner list should be "vertices".

**Well-known-binary (WKB)**: `Binary` or `LargeBinary` or `FixedSizeBinary`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave the addition of WKB (or WKT, or geojson?) for another PR, to focus here on the struct representation?

format.md Outdated Show resolved Hide resolved
format.md Outdated
@@ -217,25 +265,25 @@ Logical representation:

```
[
[[(0, 0), (0, 1), (0, 2)]],
[[{x: 0, y: 0}, {x: 0, y: 1}, {x: 0, y: 2}]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot would you be OK with keeping the examples as is here? I personally find the current version a bit easier to read, and the example is mostly to show the additional nesting, regardless of the exact inner representation. We could add a note that for the struct version, all buffers are the same, except the coordinates buffer would be two buffers for x and y.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the examples! I did add enough clarification so as not to make them confusing (hopefully).

@jorisvandenbossche jorisvandenbossche changed the title Proposal: store points/coordinates as struct arrays Allow storing points/coordinates as struct arrays in addition to interleaved layout Jul 13, 2023
format.md Outdated Show resolved Hide resolved
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@paleolimbot
Copy link
Contributor Author

Thank you everybody for the lively debate! If there aren't any objections to this proposal as worded in the latest version, I'll merge tomorrow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.