ENH: extending the basic spatial join #72

jorisvandenbossche · 2021-07-08T08:06:19Z

#54 added a basic spatial join function. It's a naive cross product of all partitions of both left and right (if no spatial partitioning information is available) or all partition combinations that intersect (based on the spatial partitioning information).

This works fine for an inner join and when your spatial partitions are already reasonable. There are however several aspects still to consider.

Supporting a left join

So currently we only allow how="inner", as for this case the naive version works nicely. However, a "left" join becomes more complicated.
Suppose we have a certain partition of the left frame that intersects with two partitions of the right frame. We perform a normal spatial join (using geopandas.sjoin) for each of those two combinations, resulting in two output partitions. But when doing a left join, geopandas.sjoin preserves the rows of the left dataframe that don't match with a geometry of the right dataframe. But so if we do this multiple times, it means that rows of the left dataframe can end up in multiple output partitions (instead of only in one of the output partitions, as is the case for an inner join). As a result, we would end up with a dask GeoDataFrame with duplicated rows.

Improving the output (spatial) partitions

Because the current implementation combines each partition of the left frame with each (intersecting) partition of the right dataframe, the result inherently has more and smaller partitions in the output.
Suppose again we have a certain partition of the left frame that intersects with two partitions of the right frame. Each of those two combinations is currently a task in the graph that ends up as a new partition in the output dataframe.

Overall (and depending on the exact (spatial) partitioning of the input), this can lead to a "fragmented" output dask GeoDataFrame with many smaller partitions, which could negatively impact (the performance of) further computations.
It might be interesting to look into "merging" chunks again (eg all output partitions coming from a certain partition of the left input could be combined (concatted) into a single output partition).

Repartition the input before performing a spatial join?

For pre-partitioned left and right datasets, the currently implemented approach (joining of geometries by partition resulting from intersection of partitions) is probably fine. But there will also be cases where it could be interesting to repartition input before performing the join.

One example case (copying from @brendan-ward mail conversation): for a target dataset (assume this has the greater number of features) that has already been partitioned, partition the query dataset to match the target (which, I guess, is the same as above, just that query dataset is a partition size of 1) - which may involve cutting the query dataset by the bounding boxes of the intersecting partitions.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2021-09-19T07:12:08Z

Supporting a left join
... when doing a left join, geopandas.sjoin preserves the rows of the left dataframe that don't match with a geometry of the right dataframe. But so if we do this multiple times, it means that rows of the left dataframe can end up in multiple output partitions

~~One way to "relatively easy" do this, is to always do at least one "left join" for each partition of the left dataframe, and then inner joins for any additional sjoin calls for that same partition.~~ (unfortunately, this would still give potentially duplicated rows)

alxmrs · 2024-02-20T07:56:31Z

Hello! What's the priority on this feature request? I'm interested in performing left joins (a left outer join) with this library.

martinfleis · 2024-02-28T10:06:56Z

@alxmrs we don't really have a capacity to work on dask-geopandas beyond maintenance these days. I'd be happy to review a PR if someone wants to give it a go but it is unlikely that any of us will try to implement left joins anytime soon.

alxmrs · 2024-02-28T10:21:02Z

Hey Martin. That is totally fair. I might be interested in submitting a PR later this year. Before I do, may I reach out to someone on your team for pointers?

…

On Wed, Feb 28, 2024 at 5:07 PM Martin Fleischmann ***@***.***> wrote: @alxmrs <https://github.com/alxmrs> we don't really have a capacity to work on dask-geopandas beyond maintenance these days. I'd be happy to review a PR if someone wants to give it a go but it is unlikely that any of us will try to implement left joins anytime soon. — Reply to this email directly, view it on GitHub <#72 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AARXABZG2PVDQSN7QYDEPXTYV36UZAVCNFSM5AAGS7C2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJWHA3DIMJTG43A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

martinfleis · 2024-02-28T10:29:42Z

Sure, if you have anything to discuss leave it in this issue.

jorisvandenbossche mentioned this issue Jul 8, 2021

ENH: basic spatial join #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: extending the basic spatial join #72

ENH: extending the basic spatial join #72

jorisvandenbossche commented Jul 8, 2021

jorisvandenbossche commented Sep 19, 2021 •

edited

Loading

alxmrs commented Feb 20, 2024

martinfleis commented Feb 28, 2024

alxmrs commented Feb 28, 2024 via email

martinfleis commented Feb 28, 2024

ENH: extending the basic spatial join #72

ENH: extending the basic spatial join #72

Comments

jorisvandenbossche commented Jul 8, 2021

jorisvandenbossche commented Sep 19, 2021 • edited Loading

alxmrs commented Feb 20, 2024

martinfleis commented Feb 28, 2024

alxmrs commented Feb 28, 2024 via email

martinfleis commented Feb 28, 2024

jorisvandenbossche commented Sep 19, 2021 •

edited

Loading