Skip to content

Commit

Permalink
docs(arrays): update blog post to include unnest examples
Browse files Browse the repository at this point in the history
  • Loading branch information
cpcloud committed Sep 24, 2023
1 parent 167c3bd commit e765712
Show file tree
Hide file tree
Showing 2 changed files with 85 additions and 26 deletions.

Large diffs are not rendered by default.

107 changes: 83 additions & 24 deletions docs/posts/bigquery-arrays/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,20 +26,22 @@ First we'll connect to BigQuery and pluck out a table to work with.
We'll start with `from ibis.interactive import *` for maximum convenience.

```{python}
from ibis.interactive import *
from ibis.interactive import * # <1>
con = ibis.connect("bigquery://ibis-gbq") # <1>
con.set_database("bigquery-public-data.imdb") # <2>
con = ibis.connect("bigquery://ibis-gbq") # <2>
con.set_database("bigquery-public-data.imdb") # <3>
```

1. Connect to the **billing** project. Compute (but not storage) is billed to
this project.
2. Set the database to the project and dataset that we will use for analysis.
1. `from ibis.interactive import *` imports Ibis APIs into the global namespace
and enables [interactive mode](../../how-to/configure/basics.qmd#interactive-mode).
2. Connect to Google BigQuery. Compute (but not storage) is billed to the
project you connect to--`ibis-gbq` in this case.
3. Set the database to the project and dataset that we will use for analysis.

Let's look at the tables in this dataset:

```{python}
con.list_tables()
con.tables
```

Let's pull out the `name_basics` table, which contains names and metadata about
Expand Down Expand Up @@ -136,7 +138,7 @@ non_actors
We can remove elements from arrays too.

::: {.callout-note}
## `remove()` does not mutate the underlying data
## [`remove()`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.remove) does not mutate the underlying data
:::

Let's see who only has "actor" in the list of their primary professions:
Expand Down Expand Up @@ -169,7 +171,7 @@ and
[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect)
APIs.

### Union
Let's take a look at `intersect`:

### Intersection

Expand Down Expand Up @@ -197,35 +199,92 @@ shared_titles

## Advanced operations

### `unnest`
### Flatten arrays into rows

As of version 7.0.0 Ibis does not support its native `unnest` API for BigQuery,
but we plan to add it in the future.
Thanks to the [tireless
efforts](https://github.com/tobymao/sqlglot/commit/06e0869e7aa5714d77e6ec763da38d6a422965fa)
of the [folks](https://github.com/tobymao/sqlglot/graphs/contributors) working
on [`sqlglot`](https://github.com/tobymao/sqlglot), as of version 7.0.0 Ibis
supports
[`unnest`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.unnest)
for BigQuery!

For now, you can use `con.sql` to construct an Ibis expression from a BigQuery
SQL string that contains `UNNEST` calls:
You can use it standalone on a column expression:

Despite lack of native `UNNEST` support, many use cases for `UNNEST` are met by
the
[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter)
and
[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map)
operations on array expressions.
```{python}
ents.primary_profession.unnest()
```

You can also use it in `select`/`mutate` calls to expand the table accordingly:

```{python}
ents.mutate(primary_profession=_.primary_profession.unnest())
```

Unnesting can be useful when joining nested data.

Here we use unnest to find people known for any of the godfather movies:

```{python}
basics = con.tables.title_basics.filter( # <1>
[
_.title_type == "movie",
_.original_title.lower().startswith("the godfather"),
_.genres.lower().contains("crime"),
]
) # <1>
known_for_the_godfather = (
ents.mutate(tconst=_.known_for_titles.unnest()) # <2>
.join(basics, "tconst") # <3>
.select("primary_title", "primary_name") # <4>
.distinct()
.order_by(["primary_title", "primary_name"]) # <4>
)
known_for_the_godfather
```

1. Filter the `title_basics` data set to only the Godfather movies
2. Unnest the `known_for_titles` array column
3. Join with `basics` to get movie titles
4. Ensure that each entity is only listed once and sort the results

Let's summarize by showing how many people are known for each Godfather movie:

```{python}
known_for_the_godfather.primary_title.value_counts()
```

### Filtering array elements

Show all people who are neither editors nor actors:
Filtering array elements can be done with the
[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter)
method, which applies a predicate to each array element and returns an array of
elements for which the predicate returns `True`.

This method is similar to Python's
[`filter`](https://docs.python.org/3.7/library/functions.html#filter) function.

Let's show all people who are neither editors nor actors:

```{python}
ents.mutate(
primary_profession=_.primary_profession.filter(
lambda pp: pp.isin(("actor", "editor"))
primary_profession=_.primary_profession.filter( # <1>
lambda pp: ~pp.isin(("actor", "editor"))
)
).filter(_.primary_profession.length() > 0)
).filter(_.primary_profession.length() > 0) # <2>
```

1. This `filter` call is applied to each array element
2. This `filter` call is applied to the table

### Applying a function to array elements

You can apply a function to run an ibis expression on each element of an array
using the
[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map)
method.

Let's normalize the case of primary_profession to upper case:

```{python}
Expand Down

0 comments on commit e765712

Please sign in to comment.