Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(blog): ibis, duckdb and lonboard for overture maps #10143

Closed
wants to merge 107 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
de4c7eb
docs(blog): ibis-duckdb and lonboard for overture maps
ncclementi Sep 13, 2024
c9a749b
chore: add visualizations as pngs
ncclementi Sep 16, 2024
a943dac
chore: add freeze false
ncclementi Sep 16, 2024
e941a3b
chore: apply code review comments
ncclementi Sep 16, 2024
f9b905b
fix(datafusion): raise when attempting to create temp table (#10072)
ncclementi Sep 10, 2024
f64a7c6
depr(selectors): deprecate `c` and `r` selectors in favor of `cols` a…
jcrist Sep 10, 2024
f329636
depr(api): deprecate `bool_val.negate()`/`-bool_val` in favor of `~bo…
jcrist Sep 10, 2024
bd06b3f
test(backends): use slightly more useful assertions for call count
cpcloud Sep 10, 2024
d8b2510
test(bigquery): make test non-strict xfail (#10081)
cpcloud Sep 10, 2024
9050b5c
feat(mssql): add lpad and rpad ops (#10060)
IndexSeek Sep 10, 2024
bcee24f
refactor(mssql): simplify lpad and rpad ops (#10085)
IndexSeek Sep 10, 2024
4ec6cac
doc: add docstring example for `topk`
jcrist Sep 10, 2024
abf0863
feat(api): add `name` argument to `topk`
jcrist Sep 10, 2024
521958b
feat(api): add `name` argument to `value_counts`
jcrist Sep 10, 2024
0ae44aa
test(polars): replace register with create_table
IndexSeek Sep 11, 2024
2b22433
fix(docs): update invalid read_parquet link
IndexSeek Sep 11, 2024
20e0bb7
perf(backends): speed up most memtable existence checks (#10067)
cpcloud Sep 11, 2024
11fb399
ci(backends): run backend doctests in CI (#9970)
cpcloud Sep 11, 2024
c3a7481
revert: fix(datafusion): raise when attempting to create temp table (…
gforsyth Sep 11, 2024
39ebbb5
ci: test examples (#10098)
cpcloud Sep 11, 2024
a507754
chore(release): 9.5.0
semantic-release-bot Sep 11, 2024
c1fa0b5
refactor(sql): simplify paren handling for binary ops
jcrist Sep 11, 2024
d0e72e9
chore(deps): update actions/create-github-app-token action to v1.11.0…
renovate[bot] Sep 12, 2024
f15b1e6
docs(dropdowns): make dropdowns scrollable and easier to see in navig…
ramlakhanmadheshiya Sep 12, 2024
6b7ceb8
test(polars): unxfail polars timestamp truncation tests by casting th…
cpcloud Sep 12, 2024
bbf4294
chore(deps): bump rich lower bound for docs/dev work (#10105)
gforsyth Sep 12, 2024
5a6f29f
docs: avoid needing to render API docs for any preview/render invocat…
cpcloud Sep 12, 2024
a88f384
refactor(duckdb): replace register usage with read
IndexSeek Sep 13, 2024
21877db
test(datafusion): replace register with create_table or read
IndexSeek Sep 13, 2024
d203eb4
refactor(dask): remove the dask backend
cpcloud Aug 4, 2024
21a858c
ci: disable verification of removed deprecations
cpcloud Sep 12, 2024
6c07598
chore: remove unused snapshots (#10107)
cpcloud Sep 13, 2024
a22527f
test(sigcheck): check function signature parity across backends (#10008)
gforsyth Sep 13, 2024
e3a02b8
test: remove dask backend marker remnants (#10114)
cpcloud Sep 13, 2024
d2a9060
fix(mysql): add dtype mapping for `mediumint`
gforsyth Sep 13, 2024
198b886
docs(how-to): fix the `ffill`/`bfill` how-to guide
deepyaman Sep 13, 2024
87ee637
docs(security): update security report address to point to private Zu…
gforsyth Sep 13, 2024
207f334
docs(code_of_conduct): update committee members and reporting email (…
gforsyth Sep 13, 2024
9f9f953
refactor(pandas): remove the pandas backend
cpcloud Aug 5, 2024
d58c619
ci: remove pandas backend jobs
cpcloud Sep 13, 2024
6d566d1
test: skip generic operation test that uses pandas if pandas is not i…
cpcloud Sep 13, 2024
b0603a2
test: remove dead fixture
cpcloud Sep 13, 2024
83606f5
test(dot-columns): add benchmark for columns property access
cpcloud Aug 26, 2024
e0e79ae
perf(api): return `tuple` from `Table.columns` instead of `list`
cpcloud Aug 26, 2024
becc6ea
test(backends): fix backend tests that assume a list
cpcloud Aug 26, 2024
3df1e32
chore(deps): update apache/impala docker tag to v4.4.1 (#10126)
renovate[bot] Sep 14, 2024
79c246e
refactor(bigquery): remove unnecessary and misspelled bigquery string…
tswast Sep 14, 2024
d8debde
fix(joins): allow chaining positional and cross joins (#10122)
gforsyth Sep 14, 2024
d49767b
chore(deps): bump poetry2nix and nixpkgs (#10127)
cpcloud Sep 14, 2024
ba36c77
ci: use jupyter cache in docs ci builds to speed up docs builds (#10128)
cpcloud Sep 14, 2024
019aed4
ci: `before` is a top level field (#10129)
cpcloud Sep 14, 2024
4d8a3b3
fix(repr): remove expression printing from exception message (#10130)
cpcloud Sep 15, 2024
01bfb6f
chore(deps): remove the `pandas` extra (#10132)
cpcloud Sep 15, 2024
3d7d530
fix(deps): update dependency sqlglot to >=23.4,<25.22 (#10109)
renovate[bot] Sep 15, 2024
1dc97a4
refactor(table-api): unify exception type for all backends to `TableN…
ncclementi Sep 16, 2024
b54f528
test(bigquery): check the correct exception for missing tables (#10137)
cpcloud Sep 16, 2024
dc16506
feat(api): add `distinct` option to `collect`
jcrist Sep 13, 2024
2b95c18
docs(api): avoid quartodoc warning about missing parameter
cpcloud Sep 16, 2024
317b055
chore: remove unused `Dispatched` utility
jcrist Sep 16, 2024
25ef934
docs(datafusion): add datafusion nyc presentation (#10141)
gforsyth Sep 16, 2024
7055525
chore(deps): update bitnami/minio docker tag to v2024.9.13 (#10146)
renovate[bot] Sep 17, 2024
3ad6e6d
chore(deps): update apache/druid docker tag to v30.0.1 (#10145)
renovate[bot] Sep 17, 2024
f910cef
chore(deps): lock file maintenance (#10134)
renovate[bot] Sep 17, 2024
3a83aa4
refactor(padding): follow python string padding conventions (#10096)
gforsyth Sep 17, 2024
75dac5c
fix(deps): update dependency datafusion to v41 (#10147)
renovate[bot] Sep 17, 2024
bde00c3
docs(datafusion): assorted edits to datafusion meetup talk (#10144)
gforsyth Sep 17, 2024
14af3f4
docs(datafusion): update talk title (#10150)
gforsyth Sep 17, 2024
6aa7c44
chore(deps): update ghcr.io/risingwavelabs/risingwave docker tag to v…
renovate[bot] Sep 18, 2024
e0d3dde
chore(deps): update trinodb/trino docker tag to v458 (#10155)
renovate[bot] Sep 18, 2024
64e97c9
chore: show deprecation warning at caller level (#10154)
NickCrews Sep 18, 2024
147f3f1
ci(pyspark): name the output path when downloading jar (#10156)
cpcloud Sep 18, 2024
59ae216
ci: run datafusion tests in series to avoid high memory usage (#10158)
cpcloud Sep 18, 2024
c3a7b43
ci(bigquery): avoid race condition in create table by using a dataset…
cpcloud Sep 18, 2024
c33c263
chore(mysql): port to MySQLdb instead of pymysql (#10077)
cpcloud Sep 18, 2024
9402725
docs(datafusion): add imdb live demo reference to end of presentation…
gforsyth Sep 18, 2024
c5ccb99
docs(table-expr): include inherited methods (all `to_*` methods) (#10…
gforsyth Sep 18, 2024
1ae4bd1
docs(synonyms): add synonym list to redirect searches with no results…
gforsyth Sep 18, 2024
2aa51ae
test(bigquery): avoid trying to clobber existing tables by generating…
cpcloud Sep 18, 2024
f34106f
fix(datatype-parsing): ensure that geospatial types are round trippab…
cpcloud Sep 19, 2024
3e9396c
docs(bigquery): add update-adc flag to gcloud auth login (#10172)
cpcloud Sep 19, 2024
12023e6
fix(mssql): ensure `ibis.random()` generates a new value per call (#1…
jcrist Sep 19, 2024
e5c11af
docs(clickhouse): entry into the accursed (#10174)
cpcloud Sep 19, 2024
fcd7564
chore: add governance paragraph and link to governance doc to README …
ncclementi Sep 20, 2024
7ca2fdc
fix(deps): update dependency sqlglot to >=23.4,<25.23 (#10176)
renovate[bot] Sep 20, 2024
7c6a176
test(bigquery): ensure that quoting test generates unique table name …
cpcloud Sep 20, 2024
2cb5771
chore(docs): clean up current pyodide build (#10180)
cpcloud Sep 20, 2024
fc626cc
ci: add automatic pull request labels (#10181)
cpcloud Sep 20, 2024
4957854
refactor(api): remove schema (#10149)
ncclementi Sep 20, 2024
19c3845
docs(build): fetch all commits to enable proper dynamic versioning in…
cpcloud Sep 20, 2024
a563504
feat(bigquery): non-nullable schema support for embedded fields in st…
ssabdb Sep 21, 2024
cf39ea0
chore: remove `poetry.lock` from nix label patterns (#10192)
cpcloud Sep 23, 2024
21d748c
chore(deps): lock file maintenance (#10191)
renovate[bot] Sep 23, 2024
7cdd2ee
chore(deps): update dependency itables to >=1.6.3,<2.3 (#10190)
renovate[bot] Sep 23, 2024
b0060e4
feat(polars): allow user to specify "engine" kwarg (#10151)
deepyaman Sep 23, 2024
0284736
fix(polars): use elementwise flatten to flatten nested arrays (#10168)
cpcloud Sep 23, 2024
39362cd
test(bigquery): xfail nested flatten test (#10194)
cpcloud Sep 23, 2024
13f9291
fix(api): use `to_pyarrow()` instead of `execute()` when pretty print…
cpcloud Sep 23, 2024
cd9ee1b
refactor(joins): require explicit abstract table as RHS of joins (#9661)
gforsyth Sep 23, 2024
b7135e7
feat(pyspark): add official support and ci testing with spark connect…
cpcloud Sep 23, 2024
3c223de
fix(snowflake): apply casting logic for json output to scalars (#10202)
cpcloud Sep 24, 2024
8191a05
chore(value): remove deprecated `greatest` and `least` methods
gforsyth Sep 23, 2024
7689b6d
chore(api): remove deprecated `where` methodism
gforsyth Sep 23, 2024
5d0e807
chore(api): remove top-level geo functions
gforsyth Sep 23, 2024
67ee2b4
chore(api): remove top-level `negate` function
gforsyth Sep 23, 2024
d16d0c7
refactor(pyarrow-format): avoid constructing unnecessary array to pro…
cpcloud Sep 24, 2024
8f66695
chore(deps): bump nix flake dependencies (#10203)
cpcloud Sep 24, 2024
92f1a89
chore: udpate after duckdb 1.1.1 release
ncclementi Sep 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

Binary file added docs/posts/ibis-overturemaps/ca-power-plants.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
211 changes: 211 additions & 0 deletions docs/posts/ibis-overturemaps/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
---
title: "From query to plot: Exploring GeoParquet Overture Maps with Ibis, DuckDB, and Lonboard"
author: Naty Clementi and Kyle Barron
date: 2024-09-13
categories:
- blog
- duckdb
- overturemaps
- lonboard
- geospatial
---

With the release of `DuckDB 1.1.0`, now we have support for reading GeoParquet
files! With this exciting update we can query rich datasets from Overture Maps using python via Ibis with the performance of `DuckDB`.

But the good news don't stop there, since `Ibis 9.2`, `lonboard` can plot data directly from an `Ibis` table, adding more simplicity and speed to your geospatial analysis.
ncclementi marked this conversation as resolved.
Show resolved Hide resolved

Let’s dive into how these tools come together.

## Installation

Install Ibis with the dependencies needed to work with geospatial data using DuckDB. To be able to read geoparquet files the duckdb version should be `>=1.1.0`.
ncclementi marked this conversation as resolved.
Show resolved Hide resolved

::: {.callout-note}
At the moment duckdb 1.1.0 has a bug that prevents us from querying and writing the data to parquet using Ibis and DuckDB, so we are installing the `overturemaps` CLI to get the data.
ncclementi marked this conversation as resolved.
Show resolved Hide resolved
:::

```bash
$ pip install 'ibis-framework[duckdb,geospatial]' lonboard overturemaps
```

## Motivation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd start with a sentence intro about what Overture Maps is. Like

Overture Maps is a project to build map data products on top of a variety of sources, like OpenStreetMap.

(not sure that's the best one-liner description of what Overture Maps is, any zingers @jwass ?)

Overture Maps offers a variety of datasets to query, but we thought that it would
be interesting to see some plots related to the power infrastructure. We'll look
into power plants, and power lines of most of the USA (excluding territories and
Alaska for simplicity of the bounding box).
ncclementi marked this conversation as resolved.
Show resolved Hide resolved

## Download data

**NOTE: not sure if I should have this here or avoid it because it's broken**

With Ibis and DuckDB we could download only the necessary columns needed but at
the moment this fails due to a bug.

For future reference when this gets fixed, the expression would look like this:

```python
import ibis
from ibis import _

con = ibis.get_backend()

# look into type infrastructure
url = "s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=infrastructure/*"
t = con.read_parquet(url, table_name="infra-usa")

# filter for USA bounding box, subtype="power", and selecting only few columns
expr = t.filter(_.bbox.xmin > -125.0, _.bbox.ymin > 24.8,
_.bbox.xmax < -65.8, _.bbox.ymax < 49.2,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for this blog post, but it would also be nice if there was some way to put this into a read_geoparquet query so that the user doesn't have to reach into these .bbox.xmin columns themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow, is this like a feature request for Ibis to implement a GeoParquet reader for Overture Maps?

DuckDB does the same thing see example https://docs.overturemaps.org/getting-data/duckdb/ , and for the record these filters get push down so you only download what's needed instead of downloading and then filtering.

_.subtype=="power"
).select(["names",
"geometry",
"class",
"sources",
"source_tags"])

```

::: {.callout-note}
If you inspect expr, you can see that the filters and projections get pushed down
meaning you only bring the data that you asked for.
ncclementi marked this conversation as resolved.
Show resolved Hide resolved
:::

```python
# to write it to a parquet file we would do (currently broken)
con.to_parquet(expr, "power-infra-usa.parquet")
```

But, no worries! Overture Maps has a nice [python CLI](https://docs.overturemaps.org/getting-data/overturemaps-py/) that allow us to download

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also do this from Python in case you wanted to avoid reaching for the command line, but this is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just learned that the DuckDB bug fix will go out next week (see Max's comment below), so I'll rewrite this part to read download the data that way.

some the data, we can filter by bounding box and type but any other filters will have to happen afterwards.

```bash
$ overturemaps download --bbox=-125.0,24.8,-65.8,49.2 -f geoparquet --type=infrastructure -o usa-infra.geoparquet
```

Now that we have the data lets explore it in Ibis interactive mode and make some beautiful maps.

## Data exploration

```{python}
import ibis
from ibis import _

ibis.options.interactive = True
con = ibis.get_backend() # default duckdb backend
ncclementi marked this conversation as resolved.
Show resolved Hide resolved
```

```{python}
usa_infra = con.read_parquet("usa-infra.geoparquet")
usa_infra
```

We take a look at the subtypes of infrastructure:

```{python}
usa_infra.subtype.value_counts().preview(max_rows=15)
```

If we want to look at power lines we need to look into `subtype=="power"`
and see what kind of `class` we have in there.

```{python}
(usa_infra.filter(_.subtype=="power")["class"]
.value_counts()
.order_by(ibis.desc("class_count"))
)
```

Looks like we have `plants`, `power_lines` and `minor_lines`, so we can make get some nice maps.

```{python}
plants = usa_infra.filter(_.subtype=="power", usa_infra["class"]=="plant")
power_lines = usa_infra.filter(_.subtype=="power", usa_infra["class"]=="power_line")
minor_lines = usa_infra.filter(_.subtype=="power", usa_infra["class"]=="minor_line")
```


## Plotting

**Note: maybe here explain why lonboard ?**

::: {.callout-note}
You can try this in your machine, for the purpose the blog file size, we will show
screenshots of the visualization
:::

```python
import lonboard
from lonboard.basemap import CartoBasemap # to choose color of basemap
```

Let's visualize the `power plants`

```python
lonboard.viz(plants,
scatterplot_kwargs={"get_fill_color": "red"},
polygon_kwargs={"get_fill_color": "red"},
map_kwargs={"basemap_style": CartoBasemap.Positron,
"view_state": {"longitude": -100, "latitude": 36, "zoom": 3}
})
```

![Power plants in the USA](usa-power-plants.png)

If you are visualizing this in your machine, you can zoom in and see some of the
geometry where the plants are located. But as an example, we can plot in a small
area of California:

```python
plants_CA = plants.filter(_.bbox.xmin.between(-118.6, -117.9),
_.bbox.ymin.between(34.5, 35.3))[_.names.primary, _.geometry]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd suggest using standard black/ruff code formatting for all Python code blocks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll fix this.

```

```python
lonboard.viz(plants_CA,
scatterplot_kwargs={"get_fill_color": "red"},
polygon_kwargs={"get_fill_color": "red"},
map_kwargs={"basemap_style": CartoBasemap.Positron,
})
```

![Power plants near Lancaster CA](ca-power-plants.png)
ncclementi marked this conversation as resolved.
Show resolved Hide resolved

We can also visualize together the `power_lines` and the `minor_lines` by doing:


```python
lonboard.viz([minor_lines, power_lines])
```

![Minor and Power lines of USA](usa-power-and-minor-lines.png)

and that's how you can visualize ~7 million points from the comfort of
your laptop.

Note: I got the ~7M by adding the number of points in power_lines and minor_lines. I'm not sure if plotting the lines that connect these points add points to this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more effort for the GPU to render lines than to render points, but a line isn't rendered as a collection of points.

It might be clearer to say 7 million coordinates rather than points.


```python
>>> power_lines.geometry.n_points().sum()
5329836
>>> minor_lines.geometry.n_points().sum()
1430042
```

With Ibis and DuckDB working with geospatial data has never been easier or faster.
We saw how to query a dataset from Overture Maps with the simplicity of Python and
the performance of DuckDB. Last but not least, we saw how simple and quick lonboard
ncclementi marked this conversation as resolved.
Show resolved Hide resolved
got us from query-to-plot. Together, these libraries make exploring and handling
geospatial data a breeze.


## Resources
- [Ibis Docs](https://ibis-project.org/)
- [Lonboard Docs](https://developmentseed.org/lonboard/latest/)
- [DuckDB spatial extension](https://duckdb.org/docs/extensions/spatial.html)
- [DuckDB spatial functions docs](https://github.com/duckdb/duckdb_spatial/blob/main/docs/functions.md)

Chat with us on Zulip:

- [Ibis Zulip Chat](https://ibis-project.zulipchat.com/)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading