Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Point representations in a nested list memory layout #16

Closed
paleolimbot opened this issue Feb 20, 2022 · 13 comments
Closed

Point representations in a nested list memory layout #16

paleolimbot opened this issue Feb 20, 2022 · 13 comments

Comments

@paleolimbot
Copy link
Contributor

paleolimbot commented Feb 20, 2022

In #12 one of the open questions is whether or not we need the flexibility to store points in more than one way. The current way to store them as written is a fixed_size_list<double> with a fixed size of the number of dimensions (e.g., 2, 3, or 4), although another item in the "open questions" is whether or not to store Z in a separate buffer (I think FlatGeoBuf does something like this).

In my mind there are a few useful storage types for points:

  • fixed_size_list<double, number_of_dimensions>: This is nice because can memcpy() from WKB and because we can do some common transforms (e.g., affine transform an array of geospatial coordinates to coordinates in a graphics device) using matrix multiplication (because it's exactly the same buffer as a row-major coordinate matrix). Because it's exactly like WKB, we can also defer to that specification for the reason why we did it that way.
  • struct(x <double>, y <double>, [z, m, etc.]): This is more native to how geometries are stored in some places (e.g., cuSpatial at one point, most places in R because our matrix objects are column-major) and has the nice property that most of the existing kernels in the Arrow compute engine operate on columns, so you could do something like (x > 5) & (y > 7) and get a lot of existing machinery for free. It's also arguably more "arrow-like" to give each dimension its own column and has a nice property that you can add or drop dimensions while efficiently re-using buffers. Some initial benchmarking suggests that it's also about 50% faster to read points from a Parquet file in this form.
  • Both of the above but with a storage type of float or decimal128. There are times when double precision is excessive and a float takes half the space; similarly, double precision arithmetic can lead to compounding errors when chaining certain operations and few formats offer the flexibility to store a fixed precision decimal (GEOS has a "precision model" and some internals that can handle exact decimal math but still uses double for input and output; CGAL has an exact arbitrary precision float type for coordinates; S2 uses openssl's BigNum to back its exact decimal math but uses double for input/output).
  • S2 or H3 cell identifiers (stored as a uint64_t)...useful because testing for containment is very fast for global (longitude, latitude) data. It also takes up half the size the equivalent longitude, latitude (if stored as doubles).

Supporting anything other more than one representation adds some complexity to the implementation because we would basically need to template any "GeoArray" class on the point representation. That said, as long as a point representation can be exported as doubles in xy, xyz, xym, or xyzm space, the templating makes it so that we can support multiple points without virtual method call overhead to do the conversion. Any reader implementation should be checking the storage types before accessing buffers, anyway, so it's my opinion that leaving the option open for more than one storage type is worth it.

@bmschmidt
Copy link

bmschmidt commented Mar 18, 2022

I've been watching this discussion closely without much to offer. Just wanted to chime with a quick experiment I ran about compressibility of these formats when writing to parquet, since I'm especially interested in this format for the web and I always find the behavior of parquet compression hard to intuit. Data here is about 400K distinct lat-long points for ip addresses from here.

The results are a little surprising to me, but I don't think they have strong implications. (I'm rooting struct(x <double>, y <double>, [z, m, etc.]) for some of the reasons described above (especially easy filtering to an x-y bounding box using non-geo-specific functions and sane extension to other points)).

At low compression levels (snappy and zstd compression level 1) struct-formatted points take up marginally less space, but at high levels of compression interleaved points compress down better. Not enough to be a major issue.

Parquet file size for 400K lat-long points, x y columns vs FixedSizedList.

(compression format, level) ('interleaved') ('separate')
(uncompressed) 1 0.9709
('snappy') 0.815973 0.799348
('gzip', 1) 0.636211 0.638659
('gzip', 5) 0.560422 0.562666
('gzip', 9) 0.555146 0.561872
('zstd', 1) 0.857768 0.811873
('zstd', 5) 0.512944 0.530301
('zstd', 9) 0.507401 0.526024

ipynb creation notebook here.

@paleolimbot
Copy link
Contributor Author

Thank you for this! I'm just getting to the "building point arrays" bit and am having a look at this again. The experiment like the one you've done here is a good example of what I've found when working with these elsewhere...basically, that we get pretty comparable performance and file sizes whether we do interleaved or separate.

A few observations I've found while implementing builders for these arrays is:

When looping over arrays, the two things that take the most time are, I think, function call overhead (the loops are pretty tight) and memory allocation (or reallocation, since the builders are implemented as growable vectors). When we do 'interleaved', we have one buffer holding all the values; when we have 'separate', we have up to 4 buffers holding the values, which means we spend 4 times as much time reallocating as we would for the 'interleaved' option.

Similarly, we can, much of the time, accomplish whatever we're doing with the points with a single function call in the 'interleaved' case. In the 'separate' case we might need up to four function calls. I think this point is pretty minor because in most cases those function calls will be to inline functions that the compiler will probably optimize out.

I don't think I happen to be doing anything where I'd notice an optimization at the 'data locality' level, but I think the 'interleaved' option is better for this: pretty much every coordinate operation needs all the coordinates at the same time. If we're doing something like matrix multiplication on coordinates (affine transform) then it's almost certainly better to keep coordinate values together.

I'm new to optimizing at this level so those observations might be misguided!

That said, I think we do need to provide an easy way into and out of this format for the reasons you mentioned above (using non-geo specific functions, sane extension to other points). While copying isn't awesome, I've basically been unable to find an example where translating between 'interleaved' and 'separate' takes a measurable amount of time (on the order of 0.2 s for 29 million points).

All that is to say that the existing proposal of points as a fixed-size list of doubles is, I think, the best option. We will see the most benefit from having points represented exactly one way when passed via the C Data interface or written to files. Translating between point representations is cheap and can be done in almost every runtime before creating a 'geoarrow' object or after reading it.

@bmschmidt
Copy link

I was worried about javascript performance on transformation, but a quick check indicates just that a for-loop that interleaves record batches one at a time performs pretty well--10ms for 400K records, 120ms redoing the operation about 100 times. That bounds the operation on my laptop between 120ms and 800ms for 30m points, depending how much reusing the same buffers over and over again is cheating.

The other important web applications thing is pushing these to GPU buffers. Interleaved has pretty clear advantages on WebGL, because it only uses up one vertex attribute set instead of two. WebGl is kind of irrelevant because usually you're pushing floats, not doubles; and God willing it will die in the next few years. But I'd imagine similar concerns will apply to WebGPU as well.

So yeah, I guess fixed-size list of doubles is probably better than separate x and y.

@paleolimbot
Copy link
Contributor Author

First...it's so cool that you can use Arrow in Observable!

Is WebGPU/WebGL the primary use-case for separate x and y vectors? Is there a function reference somewhere that I could read to see where you were hoping to pass these arrays?

@bmschmidt
Copy link

Yeah arrow-on-observable (and more recently (Parquet->Arrow)-on-observable) is really great.

Separate X and Y vectors is bad on WebGL because you need to attach two separate buffers, rather than passing just one and destructuring that pair into [x, y] on the GPU.

I have trouble thinking of a really strong case where separating [x, y] is truly helpful for programming/performance. I think my aversion to 'fixed sized list' is primarily aesthetic--latitude and longitude (or x and y in another CRS) are things that have names, so my gut says they should have names in the arrow layout. It just feels wonky to have to pluck out the first element in a list to get longitude. And in a struct, the slightly irritating confusion over whether the order is [lat, long] or [long, lat] doesn't exist. But these aren't especially strong arguments.

What exactly do you mean by "function reference?" I have a bunch of arrow->webgl stuff sitting around on observable, but it's mostly really messy code. For my own purposes a recent example is this map of Ukrainian websites that toggles between a space-filling curve layout and using [long, lat] for positions when you press a button in the lower left--it's convenient to treat both of those in terms of [x, y] coordinates and interpolate between the dimensions separately. But that's a boutique case, and it wouldn't be much of a lift to make it work with the 'interleaved' layout.

@bmschmidt
Copy link

Following up on @paleolimbot's idea in #20 that struct<x, y> seems more "arrow-like" than fixed-size list. I think the reason is that the core idea of arrow vs whatever is that everything is basically pulled out of row-major order into column-major order. Using fixedsizelist<2> over struct<x, y> feels icky because it means giving up that strategy at the most granular level of the data. It's maybe worth it because it's not how legacy systems expect data; But because it is how column-oriented systems expect things to be laid out, it may also make things more difficult with parquet indexes and other things that emerge in this area. (I had this though reading someone on the duckdb Discord noting that vectorization optimizations in their codebase means they're unlikely to use any WKB-based formats there.)

@paleolimbot
Copy link
Contributor Author

Thanks for being persistent here, I meant to follow up on @kylebarron's work on indexing here last week.

Being able to pushdown bounding box queries into a Parquet file is the best argument yet I've heard for changing the memory layout spec. @jorisvandenbossche has almost finished the PR on the memory layout and the other point layout is currently being read/written by the GDAL driver, so we'd have to backtrack a bit to make the change. I personally think that is worth it (and would be happy to implement/make data available in that format for testing).

@jorisvandenbossche
Copy link
Contributor

struct<x, y> seems more "arrow-like" than fixed-size list

Personally I don't think any of both is more "arrow-like" than the other. Both are using one of the built-in Arrow types exactly what they were meant for, both have their advantages and disadvantages, and both are being used in the wild in practice in applications.

So repeating from what I said in #20 (comment), I think we can probably be flexible in the specification about multiple options for the inner level point representation. Converting from one layout to the other is relatively trivial, and only touches this inner level (all other (offset) arrays stay exactly the same).

@paleolimbot
Copy link
Contributor Author

That's a good "point" (as it were)...I probably won't spend much time prototyping this to focus on other things, and we can circle back to allow that encoding if or when somebody does have time to demonstrate that it's better/easier/faster. I think the parquet chunks statistics issue will be a mute point eventually because tree-based indexing is a must at some point and does more or less the same thing.

@apps4uco
Copy link

Hi, I have just come across GeoParquet, after looking for some efficient storage (and processing) format for GeoSpatial data. I am by no means an expert on Parquet, so please bear that in mind if I say anything wrong or out of place. I have read through various issues, but it havent yet found a design document.

To give some context, I am considering writing a GeoParquet implementation for Rust, that possibly could be integrated into the Geozero crate in the future.

The idea of storing data to make use of the design of Parquet sounds very appealing, rather/as well as WKB, however, I have some questions about the current design decision of storing all the coordinates in a sequence.

My aim is not to offend anyone, but that the best format for storing data is chosen.

Storing all the coordinates in sequence seems to go against the columnar design of Parquet/Arrow

  • data compression works more efficiently when data representing the same variable are stored contiguously.
  • SIMD instructions are more efficient when the data is contiguous, useful for transformation of coordinate systems
  • Page Statistics ie max and min aren’t so useful when the numbers do not represent the same variable.
  • If points don’t all have the same attributes NaN can be stored in missing attributes but Parquet already has a mechanism for indicating Null values
  • Points with Timestamps as 64bit longs can’t be supported

I can understand that if the goal is just to store the data and that the x and y values are stored together and always used together then it may make sense to store them in a contiguous fixed size list. However, that would mean that (as I understand Parquet) one would lose out on optimal data compression, SIMD efficiency and bounding box queries.

My understanding of GeoParquet/GeoArrow Goals:

  • Native Parquet/Arrow encoding
  • Good Compression
  • SIMD Efficient
  • GPU Efficient (CUDA/OpenCL and Rendering)
  • Efficient Bounding Box Queries (on RowGroups, Pages and Individual Features)
  • Allow 3D and temporal data x,y,z, t (time), m (measurement) tm (and also to support that not all data has all attributes)
  • Allow Geometry collections (different geometries to be stored in a single column, ie Polygon MultiPolygon
  • …. Feel free to add anything else

Question: Are all possible simultaneously?

Options:

  1. WKB with N E S W bounding box (note not x_min x_max as that has problems when crossing the date line.
  2. Contiguous Array and N E S W bounding box
  3. Storing a Point Struct (I believe that that would mean that all the data has to have the same attributes I believe the statistics will have the bounding box (for free)
  4. Storing x in one column and y in another (Bounding Box is generated for free)

From my current understanding of Parquet I believe the latter option allows all the goals except for Geometry Collections.
However, my guess is that there isnt a one size fits all, would it be feasible to allow several different formats?

Suggestion for bounding boxes use N E S W instead of x_min x_max y_min y_max as the latter has problems when crossing the date line.

I would also like to suggest that Points be supported by storing latitude in a column and longitude in a simple table.

I’d like to get others feedback. Could a design document be written that explains the trade offs involved?

@paleolimbot
Copy link
Contributor Author

I made #26 to officially propose this change...feel free to chime in there with any additional ideas in either direction!

@paleolimbot
Copy link
Contributor Author

Just linking #32 since it's a plug for the option of dictionary-encoded points.

@paleolimbot
Copy link
Contributor Author

Closing this because we now have the option for interleaved and struct coordinates + a C implementation for both!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants