-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: extending the basic spatial join #72
Comments
|
Hello! What's the priority on this feature request? I'm interested in performing left joins (a left outer join) with this library. |
@alxmrs we don't really have a capacity to work on dask-geopandas beyond maintenance these days. I'd be happy to review a PR if someone wants to give it a go but it is unlikely that any of us will try to implement left joins anytime soon. |
Hey Martin. That is totally fair. I might be interested in submitting a PR
later this year. Before I do, may I reach out to someone on your team for
pointers?
…On Wed, Feb 28, 2024 at 5:07 PM Martin Fleischmann ***@***.***> wrote:
@alxmrs <https://github.com/alxmrs> we don't really have a capacity to
work on dask-geopandas beyond maintenance these days. I'd be happy to
review a PR if someone wants to give it a go but it is unlikely that any of
us will try to implement left joins anytime soon.
—
Reply to this email directly, view it on GitHub
<#72 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARXABZG2PVDQSN7QYDEPXTYV36UZAVCNFSM5AAGS7C2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJWHA3DIMJTG43A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sure, if you have anything to discuss leave it in this issue. |
#54 added a basic spatial join function. It's a naive cross product of all partitions of both left and right (if no spatial partitioning information is available) or all partition combinations that intersect (based on the spatial partitioning information).
This works fine for an inner join and when your spatial partitions are already reasonable. There are however several aspects still to consider.
Supporting a left join
So currently we only allow
how="inner"
, as for this case the naive version works nicely. However, a "left" join becomes more complicated.Suppose we have a certain partition of the left frame that intersects with two partitions of the right frame. We perform a normal spatial join (using
geopandas.sjoin
) for each of those two combinations, resulting in two output partitions. But when doing a left join,geopandas.sjoin
preserves the rows of the left dataframe that don't match with a geometry of the right dataframe. But so if we do this multiple times, it means that rows of the left dataframe can end up in multiple output partitions (instead of only in one of the output partitions, as is the case for an inner join). As a result, we would end up with a dask GeoDataFrame with duplicated rows.Improving the output (spatial) partitions
Because the current implementation combines each partition of the left frame with each (intersecting) partition of the right dataframe, the result inherently has more and smaller partitions in the output.
Suppose again we have a certain partition of the left frame that intersects with two partitions of the right frame. Each of those two combinations is currently a task in the graph that ends up as a new partition in the output dataframe.
Overall (and depending on the exact (spatial) partitioning of the input), this can lead to a "fragmented" output dask GeoDataFrame with many smaller partitions, which could negatively impact (the performance of) further computations.
It might be interesting to look into "merging" chunks again (eg all output partitions coming from a certain partition of the left input could be combined (concatted) into a single output partition).
Repartition the input before performing a spatial join?
For pre-partitioned left and right datasets, the currently implemented approach (joining of geometries by partition resulting from intersection of partitions) is probably fine. But there will also be cases where it could be interesting to repartition input before performing the join.
One example case (copying from @brendan-ward mail conversation): for a target dataset (assume this has the greater number of features) that has already been partitioned, partition the query dataset to match the target (which, I guess, is the same as above, just that query dataset is a partition size of 1) - which may involve cutting the query dataset by the bounding boxes of the intersecting partitions.
The text was updated successfully, but these errors were encountered: