Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trivial fix to bboxHexRadius to guarantee containment #279

Merged
merged 7 commits into from
Sep 17, 2019

Conversation

kwmsmith
Copy link
Contributor

Fixes #136 , at least in my usecases.

@nrabinowitz
Copy link
Collaborator

This looks like it should be fine, but fails the polyfill tests...

@dfellis
Copy link
Collaborator

dfellis commented Sep 11, 2019

It's possible that the test itself encoded a bad fill and was not noticed.

@kwmsmith
Copy link
Contributor Author

Looking into test failures...

@kwmsmith
Copy link
Contributor Author

I updated the test to pass based on @dfellis ' comment.

One thing I can't figure out is that the numbers for maxPolyfillSize (7057) and the number of hexagons from polyfill (1253) seem way too big for the test polygon at resolution 9.

https://github.com/uber/h3/blob/master/src/apps/testapps/testPolyfill.c

When I visualize the polygon using h3-py and use h3-py's polyfill, I get 1 resolution 9 hexagon for the given polygon, not 1253.

Screen Shot 2019-09-12 at 8 18 50 AM

What could explain the discrepancy?

@dfellis
Copy link
Collaborator

dfellis commented Sep 12, 2019

I updated the test to pass based on @dfellis ' comment.

One thing I can't figure out is that the numbers for maxPolyfillSize (7057) and the number of hexagons from polyfill (1253) seem way too big for the test polygon at resolution 9.

https://github.com/uber/h3/blob/master/src/apps/testapps/testPolyfill.c

When I visualize the polygon using h3-py and use h3-py's polyfill, I get 1 resolution 9 hexagon for the given polygon, not 1253.

Screen Shot 2019-09-12 at 8 18 50 AM

What could explain the discrepancy?

For the exact issue you're pointing out between h3 and h3-py, I think you accidentally didn't convert the polygon's radians to degrees. The C library takes lat, lng in radians, not degrees (and has a helper function to convert degsToRads and vice-versa). The Python library (and most of the bindings, tbh) decided to always use degrees for lat, lng since that is far more common for coordinates to be stored as.

As for why the algorithm implementation is apparently broken here and you had to update the number of hexagons for the test, that's basically my fault.

The current polyfill algorithm is a hack I put together a few years ago until we got line support and could implement a more efficient fill algorithm based on it (we do have h3Line now, but it has some issues crossing multiple icosahedron faces, so still not quite there, yet).

This algorithm takes the (lat,lng) bounding box of the polygon and then operates on a circle centered on the center of the bounding box with a radius to just fit the bounding box within. Then it generates a kRing cluster of hexagons that could contain that circle assuming that the hexagons are all uniform in shape and have the average shape.

The issue is that near the pentagons the distortion rises dramatically and the kRing may be oblong instead of regular. Your change effectively makes that algorithm think that the hexagons are much smaller than they are. We should probably check the minimum edge length in H3 instead of the average to find the real constant, which I think might be more like 1.2 or something.

I don't know if the performance gain in finding that exact constant there is worth it, though, since in both cases we'll be slowing things down and making it consume more memory, and especially if we can just replace this entire memory-wasting algorithm with a much smaller and faster one based on h3Line.

@kwmsmith
Copy link
Contributor Author

For the exact issue you're pointing out between h3 and h3-py, I think you accidentally didn't convert the polygon's radians to degrees. The C library takes lat, lng in radians, not degrees (and has a helper function to convert degsToRads and vice-versa). The Python library (and most of the bindings, tbh) decided to always use degrees for lat, lng since that is far more common for coordinates to be stored as.

Ah, yes, that makes sense, thanks.

And thanks for the explanation on polyfill.

@kwmsmith
Copy link
Contributor Author

@nrabinowitz @dfellis tests pass, changelog updated. Anything more? If not, can this be merged?

Copy link
Collaborator

@nrabinowitz nrabinowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me, and I think we could land it as is. I do think it would benefit from at least one test for polygons not previously filled - for example, does this fix #136, and can we add a test for that polygon to verify?

I'd also love it if you could show the results of make benchmarks after this change, just so we have a sense of the impact on performance. I'd rather we were correct than fast, but it would be good to understand the consequence.

src/apps/testapps/testPolyfill.c Show resolved Hide resolved
src/h3lib/lib/bbox.c Outdated Show resolved Hide resolved
src/apps/testapps/testPolyfill.c Show resolved Hide resolved
@kwmsmith
Copy link
Contributor Author

kwmsmith commented Sep 14, 2019

@nrabinowitz output from make benchmarks:

With scaling 1.5:

[100%] Built target benchmarkPolyfill
        -- polyfillSF: 2889.500000 microseconds per iteration (500 iterations)
        -- polyfillAlameda: 5132.732000 microseconds per iteration (500 iterations)
        -- polyfillSouthernExpansion: 218891.200000 microseconds per iteration (10 iterations)

With scaling 1.0:

[100%] Built target benchmarkPolyfill
Scanning dependencies of target bench_benchmarkPolyfill
        -- polyfillSF: 6377.710000 microseconds per iteration (500 iterations)
        -- polyfillAlameda: 10470.600000 microseconds per iteration (500 iterations)
        -- polyfillSouthernExpansion: 479781.800000 microseconds per iteration (10 iterations)

Performance slows by a bit over 2x.

@kwmsmith
Copy link
Contributor Author

Scaling 1.5

resolution 13
num hexagons: 3945
polyfill-1.5

Scaling 1.5 detail

detail-1.5

Scaling 1.0

resolution 13
num hexagons: 4353
polyfill-1.0

Scaling 1.0 detail

detail-1.0

@kwmsmith
Copy link
Contributor Author

I think all comments have been addressed:

  • see above for benchmark comparison
  • regression test added: src/apps/testapps/testPolyfill_GH136.c
  • Visual comparison between old / new for one case.

Anything else?

@coveralls
Copy link

coveralls commented Sep 16, 2019

Coverage Status

Coverage remained the same at 99.253% when pulling 2f5a701 on kwmsmith:fix-polyfill into 4bdb15c on uber:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 99.253% when pulling 56d2ff5 on kwmsmith:fix-polyfill into 4bdb15c on uber:master.

@nrabinowitz
Copy link
Collaborator

@kwmsmith Thanks so much for the follow-up here. I feel like ideally we'd be able to have an actual mathematically derived constant here, since if we could choose, say, 1.25 and only get 1.5x perf cost, that would be better. I'm unfortunately not sure how we'd calculate it, though. An empirical option would be to test on a large sample of rectangular polygons, and verify that the cells at each corner of each polygon were included in the polyfill, though in this case the issue is determining a large enough sample set.

All that said, I'd rather be correct than fast, and I think we should probably land this as is and address performance later.

@kwmsmith
Copy link
Contributor Author

All that said, I'd rather be correct than fast, and I think we should probably land this as is and address performance later.

That's my strong preference -- I can swallow a 2x perf degradation. But I can't swallow incorrect polyfill results :)

@dfellis mentioned re-implementing polyfill using H3Line or somesuch -- when would that better and faster implementation land?

@nrabinowitz
Copy link
Collaborator

@dfellis mentioned re-implementing polyfill using H3Line or somesuch -- when would that better and faster implementation land?

This is referring to adapting the implementation we had in a prior version of H3, when we had a simpler hex grid. Honestly, we're a bit out from it, and I suspect this is a v4 feature.

  • It depends on Improvement to h3Line algorithm: Support any great arc #195, which itself is a bit of work
  • It may have weird cases around pentagons that need more investigation
  • It would result in a spherical polyfill, where now we use a Cartesian polyfill. This is both a potentially breaking API change, and we'd need to implement spherical point-in-poly.

@dfellis
Copy link
Collaborator

dfellis commented Sep 16, 2019

All that said, I'd rather be correct than fast, and I think we should probably land this as is and address performance later.

That's my strong preference -- I can swallow a 2x perf degradation. But I can't swallow incorrect polyfill results :)

@dfellis mentioned re-implementing polyfill using H3Line or somesuch -- when would that better and faster implementation land?

I don't really know when we'll get that done. I only have my spare time to work on this, anymore, since I'm not at Uber.

The dependency graph on that is:

  • Get H3Line draw lines across icosahedron edges (possible solution, detect icosahedron edge intersections, segment the line drawing for each icosahedron, and detect/eliminate the duplicate hexagon ID)
  • Implement the 2D area filling algorithm using one of the algorithms described here after tracing the polygon with H3Line and converting them to the 2D IJ coordinates. (The IJ coordinates may also require split across icosahedrons and then merging the results, I don't remember where we got with the icosahedron unfolding concept.)

So not impossible, and if I could dedicate all of my time on this, I could probably get it in a week, but since I can't, I don't really know what the ETA on that would be.

src/apps/testapps/testPolyfill_GH136.c Outdated Show resolved Hide resolved
@isaacbrodsky isaacbrodsky merged commit 716e57a into uber:master Sep 17, 2019
@nrabinowitz
Copy link
Collaborator

FYI @dfellis and @kwmsmith - I'm looking into an alternate algorithm, prototyped here: https://observablehq.com/@nrabinowitz/spidering-polyfill

This doesn't depend on h3Line, considers a far smaller set of cells than the giant k-ring approach, and should have the same (or better) guarantees of correctness. Would be interested in any thoughts or comments.

@dfellis
Copy link
Collaborator

dfellis commented Sep 18, 2019

FYI @dfellis and @kwmsmith - I'm looking into an alternate algorithm, prototyped here: https://observablehq.com/@nrabinowitz/spidering-polyfill

This doesn't depend on h3Line, considers a far smaller set of cells than the giant k-ring approach, and should have the same (or better) guarantees of correctness. Would be interested in any thoughts or comments.

My thoughts:

That does seem like a mostly improved approach versus what we have now, assuming we fix an issue with it I'll go into more depth later.

It's probably slower than the speed H3Line + IJ-based filling could do (because of the repeated k-ring generation), but I'm not sure if it would be significantly slower.

The maxPolyfillSize could be reduced to taking the area of the bounding box and dividing it by the minimum area of a hexagon (or just a pentagon area to be safe).

The current implementation doesn't seem like anything but the first starting point is used, because it recurses in depth from that point (though I suppose it's possible that holes in the geometry could produce fully-isolated geometries and that it would be necessary to iterate them all), but I think a breadth-based search algorithm would be better: A while loop of searching for neighbors in the geofence and add them to the set, then repeat the process only for the neighbors found to be in the polygon and add their matching neighbors to the set and repeat. The recursion depth would be lower (as each iteration finds more of the interior hexagons) making a stack overflow less likely (actually, it should be possible to implement without any recursive function calls, just three Sets, one for all matching hexagons and one for the hexagons to test their neighbors on, and the third for the newly found neighbors, until the newly found neighbors set is empty).

Now that I'm thinking about it more, the recursion-less breadth-based search should also be faster, because the two sets for finding neighbors and the previous matches that should be tested can be allocated once at the beginning, and then at the end of the loop just reassigning the the new matches memory as the old matches variable and zeroing the old matches and reassigning to the new matches variable should work perfectly, while the recursive version has to allocate small chunks of memory on each step.

@nrabinowitz do you mind if I try to implement that? I think this is a great 80/20 rule type situation versus the "best" H3Line + IJ fill approach and I think it would only take me a day, which is much more motivating to me. :) Unless you want to do it, or you disagree with my comments above?

@isaacbrodsky
Copy link
Collaborator

@dfellis I presume the sets are fixed length arrays? Are they just the size of maxPolyfillSize?

@dfellis
Copy link
Collaborator

dfellis commented Sep 18, 2019

@dfellis I presume the sets are fixed length arrays? Are they just the size of maxPolyfillSize?

I think that would be safest. If the polygon was guaranteed to be convex you could cut it down to just the number of hexagons needed just for the outer rim of the bounding box (as the guaranteed maximum), but a concavity in the polygon means instead of the average of 3 hexagons could be up to the whole 6 hexagons in the k-ring, and enough back-and-forth could cause an intermediate set to exceed the bounding box rim set.

@nrabinowitz
Copy link
Collaborator

@dfellis I'm torn here, because I'd like to implement this myself but also realize that I won't have time to do so in the next few weeks. So I think it would be great if you want to take a shot at it. A few notes:

  • There's now an iterative version at https://observablehq.com/@nrabinowitz/spidering-polyfill, which might be worth looking at. @isaacbrodsky and I had discussed using this implementation with a linked-list stack, which would allow us to allocate only necessary memory for the to-be-evaluated hexagons. You may be right, though, that allocating one list item at a time like this would be slower than a big array at the beginning, so up to you. Without a real stack, though, it seems like you'd need to iterate through the array each time to find the new candidates (unless I'm missing something)? The linked-list would allow constant-time pop and push for the candidate stack.

  • I hadn't thought through maxPolyfillSize, but I like the bbox/average hex area approach. It's possible you'd undercount, though, e.g. if the input was a rectangle in a portion of the grid with larger-than-average hex area. So a scaling factor should probably be applied here.

  • @isaacbrodsky and I spent some time thinking through this yesterday, and you need the initial set to include the 1-rings for all vertices and all hole vertices. Without this, we were able to identify shapes with non-traversable "isthmuses" (think e.g. an hourglass) that would prevent full polyfill. This isn't a big deal, as the point-in-poly logic already guarantees that the number of vertices is a factor in the time complexity.

Let me know if you want to discuss it any further. If you don't get to this relatively soon, I might find time to pick it up, but I'd ping you first.

@dfellis
Copy link
Collaborator

dfellis commented Sep 18, 2019

@dfellis I'm torn here, because I'd like to implement this myself but also realize that I won't have time to do so in the next few weeks. So I think it would be great if you want to take a shot at it. A few notes:

I'll have some time tomorrow to take a stab at it. I'll let you know how far along I get.

* There's now an iterative version at https://observablehq.com/@nrabinowitz/spidering-polyfill, which might be worth looking at. @isaacbrodsky and I had discussed using this implementation with a linked-list stack, which would allow us to allocate only necessary memory for the to-be-evaluated hexagons. You may be right, though, that allocating one list item at a time like this would be slower than a big array at the beginning, so up to you. Without a real stack, though, it seems like you'd need to iterate through the array each time to find the new candidates (unless I'm missing something)? The linked-list would allow constant-time `pop` and `push` for the candidate stack.

First point, thank you for that. That looks better, but still not quite what I mentioned above (that one's going to expand outward from the first point as far as it can, then it will go to the next point and try to expand, etc. I'm imagining something that takes all the points first, expands each of them and checks them, then takes that new set and expands them outward, etc). I'm not sure if there's any performance benefit on the way I was envisioning it and this one, though.

As to your question about a chunk of memory versus linked lists, that's assuming that we're going to use the same hash function as we've used before. I think here an algorithm that's more expensive to insert but much cheaper to iterate is the best idea, since the vast majority of the time I expect the number of output cells to be roughly equal to the number of input cells, which are based on the number of points defining the polygon (in the beginning I bet output ~= 3x input, but it will eventually level off and the last few iterations will have an output significantly smaller than the input as the k-rings keep running into already-found hexagons).

The final output set should probably be hashed normally since it's focused on insert and has checking but these two intermediate sets seem like it would be best to use an algorithm that takes the highest and lowest hexagon ID (by numeric value) seen in the prior set and tries to insert new hexagon IDs in quasi-sorted order, but if it runs into an existing hexagon there goes to the next slot and fills it in, using a "guess" on how many hexagons it's going to find in the next iteration to determine the proper spacing from low to high. That should produce a "mostly compact" set, and if we also have helper variables keeping a count of how many were inserted in each set, we can iterate the set until we find them all but not have to iterate over the entire chunk of memory allocated. insert will be slightly slower (collisions will be higher) and has would require a full iteration of the memory, but iterate would be very fast like the linked-list approach without the constant allocations.

* I hadn't thought through `maxPolyfillSize`, but I like the bbox/average hex area approach. It's possible you'd undercount, though, e.g. if the input was a rectangle in a portion of the grid with larger-than-average hex area. So a scaling factor should probably be applied here.

That's why I proposed bbox / pentagon area -- it will overestimate, but it should never underestimate (I'm also weary of my ability to come up with a constant bigger than that since this whole conversation started from my not determining a good constant here).

* @isaacbrodsky and I spent some time thinking through this yesterday, and you need the initial set to include the 1-rings for _all_ vertices and all _hole_ vertices. Without this, we were able to identify shapes with non-traversable "isthmuses" (think e.g. an hourglass) that would prevent full polyfill. This isn't a big deal, as the point-in-poly logic already guarantees that the number of vertices is a factor in the time complexity.

Including all vertices seems perfectly fine.

Let me know if you want to discuss it any further. If you don't get to this relatively soon, I might find time to pick it up, but I'd ping you first.

Here's my discussion. :) Going to stop here for the day, though, and I'll try to get to prototyping something out tomorrow.

@nrabinowitz
Copy link
Collaborator

@ajfriend helped identify a failure case here:

Screen Shot 2019-09-19 at 5 36 02 PM

Because you can't traverse between contained cells, you wouldn't get them with this algo :(.

@dfellis
Copy link
Collaborator

dfellis commented Sep 20, 2019

I'm pretty sure a k = 2 approach to neighbor traversal will solve this problem. The algorithm would find the first and last hexagons, and at k = 2, we definitely catch these (because they are X, Y grid aligned and since this issue only occurs if the polygon is narrow and very straight in an X, Y coordinate system, k = 2 should be what's needed to find it).

Can you get me the polygon here so I can add a test case to my branch? I would hate to lose the 3-4x performance improvement because of this.

@nrabinowitz
Copy link
Collaborator

I'm pretty sure a k = 2 approach to neighbor traversal will solve this problem. The algorithm would find the first and last hexagons, and at k = 2, we definitely catch these (because they are X, Y grid aligned and since this issue only occurs if the polygon is narrow and very straight in an X, Y coordinate system, k = 2 should be what's needed to find it).

Can you get me the polygon here so I can add a test case to my branch? I would hate to lose the 3-4x performance improvement because of this.

This one was literally sketched on top of a screenshot, but I could come up with a real polygon.

I think k=2 won't do it, sadly. I think if you take a very long, very narrow triangle, center it on a given hex so that the hex center is contained, then rotate it at different angles, you can find angles w/r/t the grid where you'd have an arbitrarily large buffer of non-traversable cells.

I'm also bummed by this, and not sure whether we should care about these edge cases or not. The challenge is not only "Are we correct for a single weird polygon?", but "Are we correct for a set of contiguous polygons?" - in the latter case, poor handling of slivers could end up with "dead" cells in the set, clearly within the boundary of the set, but not assigned to any polygon.

@dfellis
Copy link
Collaborator

dfellis commented Sep 21, 2019

I'm pretty sure a k = 2 approach to neighbor traversal will solve this problem. The algorithm would find the first and last hexagons, and at k = 2, we definitely catch these (because they are X, Y grid aligned and since this issue only occurs if the polygon is narrow and very straight in an X, Y coordinate system, k = 2 should be what's needed to find it).
Can you get me the polygon here so I can add a test case to my branch? I would hate to lose the 3-4x performance improvement because of this.

This one was literally sketched on top of a screenshot, but I could come up with a real polygon.

I think k=2 won't do it, sadly. I think if you take a very long, very narrow triangle, center it on a given hex so that the hex center is contained, then rotate it at different angles, you can find angles w/r/t the grid where you'd have an arbitrarily large buffer of non-traversable cells.

I'm also bummed by this, and not sure whether we should care about these edge cases or not. The challenge is not only "Are we correct for a single weird polygon?", but "Are we correct for a set of contiguous polygons?" - in the latter case, poor handling of slivers could end up with "dead" cells in the set, clearly within the boundary of the set, but not assigned to any polygon.

So I see what you're getting at now. This also indicts the H3Line approach outlined above since this polygon can't even draw a line between the points and maintain the contiguous-polygons-produce-contiguous-H3-Sets behavior.

But I do still have an idea for a solution here, and it involves H3Line. The algo uses the polygon's points to get likely starting hexagons to "spider" from, but there's no reason why that has to be the starting set. We could execute H3Line operations between the points in the polygons to get a larger set of hexagons to test from. This would let us get all of these hexagons because while we can't traverse from the points, we can traverse from places along the line segments.

It would only fail where a line segment is crossing an icosahedron (we can fall back to just the points for now and fix up H3Line later), and there would be problems with polygons that are large enough that a single segment is affected by the spherical vs cartesian projection differences (but that would also require that the polygon is one of these degenerate polygons as well at that large of a scale, which seems very unlikely and unworthy of supporting).

I would need to see how putting H3Line into the "spider" algorithm affects performance. I expect it to hurt it some, but I'm not sure by how much.

mrdvt92 pushed a commit to mrdvt92/h3 that referenced this pull request Jun 19, 2022
mrdvt92 pushed a commit to mrdvt92/h3 that referenced this pull request Jun 19, 2022
trivial fix to bboxHexRadius to guarantee containment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

polyfill not filling the complete polygon
5 participants