-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Point representations in a nested list memory layout #16
Comments
I've been watching this discussion closely without much to offer. Just wanted to chime with a quick experiment I ran about compressibility of these formats when writing to parquet, since I'm especially interested in this format for the web and I always find the behavior of parquet compression hard to intuit. Data here is about 400K distinct lat-long points for ip addresses from here. The results are a little surprising to me, but I don't think they have strong implications. (I'm rooting At low compression levels (snappy and zstd compression level 1) struct-formatted points take up marginally less space, but at high levels of compression interleaved points compress down better. Not enough to be a major issue. Parquet file size for 400K lat-long points, x y columns vs FixedSizedList.
|
Thank you for this! I'm just getting to the "building point arrays" bit and am having a look at this again. The experiment like the one you've done here is a good example of what I've found when working with these elsewhere...basically, that we get pretty comparable performance and file sizes whether we do interleaved or separate. A few observations I've found while implementing builders for these arrays is: When looping over arrays, the two things that take the most time are, I think, function call overhead (the loops are pretty tight) and memory allocation (or reallocation, since the builders are implemented as growable vectors). When we do 'interleaved', we have one buffer holding all the values; when we have 'separate', we have up to 4 buffers holding the values, which means we spend 4 times as much time reallocating as we would for the 'interleaved' option. Similarly, we can, much of the time, accomplish whatever we're doing with the points with a single function call in the 'interleaved' case. In the 'separate' case we might need up to four function calls. I think this point is pretty minor because in most cases those function calls will be to inline functions that the compiler will probably optimize out. I don't think I happen to be doing anything where I'd notice an optimization at the 'data locality' level, but I think the 'interleaved' option is better for this: pretty much every coordinate operation needs all the coordinates at the same time. If we're doing something like matrix multiplication on coordinates (affine transform) then it's almost certainly better to keep coordinate values together. I'm new to optimizing at this level so those observations might be misguided! That said, I think we do need to provide an easy way into and out of this format for the reasons you mentioned above (using non-geo specific functions, sane extension to other points). While copying isn't awesome, I've basically been unable to find an example where translating between 'interleaved' and 'separate' takes a measurable amount of time (on the order of 0.2 s for 29 million points). All that is to say that the existing proposal of points as a fixed-size list of doubles is, I think, the best option. We will see the most benefit from having points represented exactly one way when passed via the C Data interface or written to files. Translating between point representations is cheap and can be done in almost every runtime before creating a 'geoarrow' object or after reading it. |
I was worried about javascript performance on transformation, but a quick check indicates just that a for-loop that interleaves record batches one at a time performs pretty well--10ms for 400K records, 120ms redoing the operation about 100 times. That bounds the operation on my laptop between 120ms and 800ms for 30m points, depending how much reusing the same buffers over and over again is cheating. The other important web applications thing is pushing these to GPU buffers. Interleaved has pretty clear advantages on WebGL, because it only uses up one vertex attribute set instead of two. WebGl is kind of irrelevant because usually you're pushing floats, not doubles; and God willing it will die in the next few years. But I'd imagine similar concerns will apply to WebGPU as well. So yeah, I guess fixed-size list of doubles is probably better than separate x and y. |
First...it's so cool that you can use Arrow in Observable! Is WebGPU/WebGL the primary use-case for separate x and y vectors? Is there a function reference somewhere that I could read to see where you were hoping to pass these arrays? |
Yeah arrow-on-observable (and more recently (Parquet->Arrow)-on-observable) is really great. Separate X and Y vectors is bad on WebGL because you need to attach two separate buffers, rather than passing just one and destructuring that pair into [x, y] on the GPU. I have trouble thinking of a really strong case where separating [x, y] is truly helpful for programming/performance. I think my aversion to 'fixed sized list' is primarily aesthetic--latitude and longitude (or x and y in another CRS) are things that have names, so my gut says they should have names in the arrow layout. It just feels wonky to have to pluck out the first element in a list to get longitude. And in a struct, the slightly irritating confusion over whether the order is [lat, long] or [long, lat] doesn't exist. But these aren't especially strong arguments. What exactly do you mean by "function reference?" I have a bunch of arrow->webgl stuff sitting around on observable, but it's mostly really messy code. For my own purposes a recent example is this map of Ukrainian websites that toggles between a space-filling curve layout and using [long, lat] for positions when you press a button in the lower left--it's convenient to treat both of those in terms of [x, y] coordinates and interpolate between the dimensions separately. But that's a boutique case, and it wouldn't be much of a lift to make it work with the 'interleaved' layout. |
Following up on @paleolimbot's idea in #20 that struct<x, y> seems more "arrow-like" than fixed-size list. I think the reason is that the core idea of arrow vs whatever is that everything is basically pulled out of row-major order into column-major order. Using |
Thanks for being persistent here, I meant to follow up on @kylebarron's work on indexing here last week. Being able to pushdown bounding box queries into a Parquet file is the best argument yet I've heard for changing the memory layout spec. @jorisvandenbossche has almost finished the PR on the memory layout and the other point layout is currently being read/written by the GDAL driver, so we'd have to backtrack a bit to make the change. I personally think that is worth it (and would be happy to implement/make data available in that format for testing). |
Personally I don't think any of both is more "arrow-like" than the other. Both are using one of the built-in Arrow types exactly what they were meant for, both have their advantages and disadvantages, and both are being used in the wild in practice in applications. So repeating from what I said in #20 (comment), I think we can probably be flexible in the specification about multiple options for the inner level point representation. Converting from one layout to the other is relatively trivial, and only touches this inner level (all other (offset) arrays stay exactly the same). |
That's a good "point" (as it were)...I probably won't spend much time prototyping this to focus on other things, and we can circle back to allow that encoding if or when somebody does have time to demonstrate that it's better/easier/faster. I think the parquet chunks statistics issue will be a mute point eventually because tree-based indexing is a must at some point and does more or less the same thing. |
Hi, I have just come across GeoParquet, after looking for some efficient storage (and processing) format for GeoSpatial data. I am by no means an expert on Parquet, so please bear that in mind if I say anything wrong or out of place. I have read through various issues, but it havent yet found a design document. To give some context, I am considering writing a GeoParquet implementation for Rust, that possibly could be integrated into the Geozero crate in the future. The idea of storing data to make use of the design of Parquet sounds very appealing, rather/as well as WKB, however, I have some questions about the current design decision of storing all the coordinates in a sequence. My aim is not to offend anyone, but that the best format for storing data is chosen. Storing all the coordinates in sequence seems to go against the columnar design of Parquet/Arrow
I can understand that if the goal is just to store the data and that the x and y values are stored together and always used together then it may make sense to store them in a contiguous fixed size list. However, that would mean that (as I understand Parquet) one would lose out on optimal data compression, SIMD efficiency and bounding box queries. My understanding of GeoParquet/GeoArrow Goals:
Question: Are all possible simultaneously? Options:
From my current understanding of Parquet I believe the latter option allows all the goals except for Geometry Collections. Suggestion for bounding boxes use N E S W instead of x_min x_max y_min y_max as the latter has problems when crossing the date line. I would also like to suggest that Points be supported by storing latitude in a column and longitude in a simple table. I’d like to get others feedback. Could a design document be written that explains the trade offs involved? |
I made #26 to officially propose this change...feel free to chime in there with any additional ideas in either direction! |
Just linking #32 since it's a plug for the option of dictionary-encoded points. |
Closing this because we now have the option for interleaved and struct coordinates + a C implementation for both! |
In #12 one of the open questions is whether or not we need the flexibility to store points in more than one way. The current way to store them as written is a
fixed_size_list<double>
with a fixed size of the number of dimensions (e.g., 2, 3, or 4), although another item in the "open questions" is whether or not to store Z in a separate buffer (I think FlatGeoBuf does something like this).In my mind there are a few useful storage types for points:
fixed_size_list<double, number_of_dimensions>
: This is nice because canmemcpy()
from WKB and because we can do some common transforms (e.g., affine transform an array of geospatial coordinates to coordinates in a graphics device) using matrix multiplication (because it's exactly the same buffer as a row-major coordinate matrix). Because it's exactly like WKB, we can also defer to that specification for the reason why we did it that way.struct(x <double>, y <double>, [z, m, etc.])
: This is more native to how geometries are stored in some places (e.g., cuSpatial at one point, most places in R because our matrix objects are column-major) and has the nice property that most of the existing kernels in the Arrow compute engine operate on columns, so you could do something like(x > 5) & (y > 7)
and get a lot of existing machinery for free. It's also arguably more "arrow-like" to give each dimension its own column and has a nice property that you can add or drop dimensions while efficiently re-using buffers. Some initial benchmarking suggests that it's also about 50% faster to read points from a Parquet file in this form.uint64_t
)...useful because testing for containment is very fast for global (longitude, latitude) data. It also takes up half the size the equivalent longitude, latitude (if stored as doubles).Supporting anything other more than one representation adds some complexity to the implementation because we would basically need to template any "GeoArray" class on the point representation. That said, as long as a point representation can be exported as doubles in xy, xyz, xym, or xyzm space, the templating makes it so that we can support multiple points without virtual method call overhead to do the conversion. Any reader implementation should be checking the storage types before accessing buffers, anyway, so it's my opinion that leaving the option open for more than one storage type is worth it.
The text was updated successfully, but these errors were encountered: