Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: extending the basic spatial join #72

Open
jorisvandenbossche opened this issue Jul 8, 2021 · 5 comments
Open

ENH: extending the basic spatial join #72

jorisvandenbossche opened this issue Jul 8, 2021 · 5 comments

Comments

@jorisvandenbossche
Copy link
Member

#54 added a basic spatial join function. It's a naive cross product of all partitions of both left and right (if no spatial partitioning information is available) or all partition combinations that intersect (based on the spatial partitioning information).

This works fine for an inner join and when your spatial partitions are already reasonable. There are however several aspects still to consider.

Supporting a left join

So currently we only allow how="inner", as for this case the naive version works nicely. However, a "left" join becomes more complicated.
Suppose we have a certain partition of the left frame that intersects with two partitions of the right frame. We perform a normal spatial join (using geopandas.sjoin) for each of those two combinations, resulting in two output partitions. But when doing a left join, geopandas.sjoin preserves the rows of the left dataframe that don't match with a geometry of the right dataframe. But so if we do this multiple times, it means that rows of the left dataframe can end up in multiple output partitions (instead of only in one of the output partitions, as is the case for an inner join). As a result, we would end up with a dask GeoDataFrame with duplicated rows.

Improving the output (spatial) partitions

Because the current implementation combines each partition of the left frame with each (intersecting) partition of the right dataframe, the result inherently has more and smaller partitions in the output.
Suppose again we have a certain partition of the left frame that intersects with two partitions of the right frame. Each of those two combinations is currently a task in the graph that ends up as a new partition in the output dataframe.

Overall (and depending on the exact (spatial) partitioning of the input), this can lead to a "fragmented" output dask GeoDataFrame with many smaller partitions, which could negatively impact (the performance of) further computations.
It might be interesting to look into "merging" chunks again (eg all output partitions coming from a certain partition of the left input could be combined (concatted) into a single output partition).

Repartition the input before performing a spatial join?

For pre-partitioned left and right datasets, the currently implemented approach (joining of geometries by partition resulting from intersection of partitions) is probably fine. But there will also be cases where it could be interesting to repartition input before performing the join.

One example case (copying from @brendan-ward mail conversation): for a target dataset (assume this has the greater number of features) that has already been partitioned, partition the query dataset to match the target (which, I guess, is the same as above, just that query dataset is a partition size of 1) - which may involve cutting the query dataset by the bounding boxes of the intersecting partitions.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Sep 19, 2021

Supporting a left join
... when doing a left join, geopandas.sjoin preserves the rows of the left dataframe that don't match with a geometry of the right dataframe. But so if we do this multiple times, it means that rows of the left dataframe can end up in multiple output partitions

One way to "relatively easy" do this, is to always do at least one "left join" for each partition of the left dataframe, and then inner joins for any additional sjoin calls for that same partition. (unfortunately, this would still give potentially duplicated rows)

@alxmrs
Copy link

alxmrs commented Feb 20, 2024

Hello! What's the priority on this feature request? I'm interested in performing left joins (a left outer join) with this library.

@martinfleis
Copy link
Member

@alxmrs we don't really have a capacity to work on dask-geopandas beyond maintenance these days. I'd be happy to review a PR if someone wants to give it a go but it is unlikely that any of us will try to implement left joins anytime soon.

@alxmrs
Copy link

alxmrs commented Feb 28, 2024 via email

@martinfleis
Copy link
Member

Sure, if you have anything to discuss leave it in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants