diff --git a/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json b/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json index 288341b461fa..e504c118fe65 100644 --- a/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json +++ b/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "47440a350432e93f85e6ed6553cc40f0", + "hash": "01dfd143b9a173aa5325737aa228bd82", "result": { - "markdown": "---\ntitle: Working with arrays in Google BigQuery\nauthor: \"Phillip Cloud\"\ndate: \"2023-09-12\"\ncategories:\n - blog\n - bigquery\n - arrays\n - cloud\n---\n\n## Introduction\n\nIbis and BigQuery have [worked well together for years](https://cloud.google.com/blog/products/data-analytics/ibis-and-bigquery-scalable-analytics-comfort-python).\n\nIn Ibis 7.0.0, they work even better together with the addition of array\nfunctionality for BigQuery.\n\nLet's look at some examples using BigQuery's [IMDB sample\ndata](https://developer.imdb.com/non-commercial-datasets/).\n\n## Basics\n\nFirst we'll connect to BigQuery and pluck out a table to work with.\n\nWe'll start with `from ibis.interactive import *` for maximum convenience.\n\n::: {#75a9d26f .cell execution_count=1}\n``` {.python .cell-code}\nfrom ibis.interactive import *\n\ncon = ibis.connect(\"bigquery://ibis-gbq\") # <1>\ncon.set_database(\"bigquery-public-data.imdb\") # <2>\n```\n:::\n\n\n1. Connect to the **billing** project. Compute (but not storage) is billed to\n this project.\n2. Set the database to the project and dataset that we will use for analysis.\n\nLet's look at the tables in this dataset:\n\n::: {#203b6b28 .cell execution_count=2}\n``` {.python .cell-code}\ncon.list_tables()\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```\n['name_basics',\n 'reviews',\n 'title_akas',\n 'title_basics',\n 'title_crew',\n 'title_episode',\n 'title_principals',\n 'title_ratings']\n```\n:::\n:::\n\n\nLet's pull out the `name_basics` table, which contains names and metadata about\npeople listed on IMDB. We'll call this `ents` (short for `entities`), and remove some\ncolumns we won't need:\n\n::: {#6229c913 .cell execution_count=3}\n``` {.python .cell-code}\nents = con.tables.name_basics.drop(\"birth_year\", \"death_year\")\nents\n```\n\n::: {.cell-output .cell-output-display execution_count=3}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name          primary_profession  known_for_titles    ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringstring              │\n├───────────┼──────────────────────┼────────────────────┼─────────────────────┤\n│ nm7200466Sam Townsend        NULLNULL                │\n│ nm7222639Marc Goula          NULLtt3185588           │\n│ nm7236451Charlie Furusho     NULLtt4548374           │\n│ nm7245943Cynthia Llanes      NULLNULL                │\n│ nm7252258Lance Hamner        NULLtt0247882           │\n│ nm7254706Paloma White        NULLNULL                │\n│ nm7256968Bart den Hartigh    NULLtt3947934           │\n│ nm7268314Don Cummings        NULLtt4613692,tt0042078 │\n│ nm7286675Svitlana BanschukovaNULLtt4636896           │\n│ nm7287050Glenn McCready      NULLtt4637318           │\n│                    │\n└───────────┴──────────────────────┴────────────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n### Splitting strings into arrays\n\nWe can see that `known_for_titles` looks sort of like an array, so let's call\nthe\n[`split`](../../reference/expression-strings.qmd#ibis.expr.types.strings.StringValue.split)\nmethod on that column and replace the existing column:\n\n::: {#1763a10e .cell execution_count=4}\n``` {.python .cell-code}\nents = ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nents\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name          primary_profession  known_for_titles           ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>              │\n├───────────┼──────────────────────┼────────────────────┼────────────────────────────┤\n│ nm7200466Sam Townsend        NULL[]                         │\n│ nm7222639Marc Goula          NULL['tt3185588']              │\n│ nm7236451Charlie Furusho     NULL['tt4548374']              │\n│ nm7245943Cynthia Llanes      NULL[]                         │\n│ nm7252258Lance Hamner        NULL['tt0247882']              │\n│ nm7254706Paloma White        NULL[]                         │\n│ nm7256968Bart den Hartigh    NULL['tt3947934']              │\n│ nm7268314Don Cummings        NULL['tt4613692', 'tt0042078'] │\n│ nm7286675Svitlana BanschukovaNULL['tt4636896']              │\n│ nm7287050Glenn McCready      NULL['tt4637318']              │\n│                           │\n└───────────┴──────────────────────┴────────────────────┴────────────────────────────┘\n
\n```\n:::\n:::\n\n\nSimilarly for `primary_profession`, since people involved in show business often\nhave more than one responsibility on a project:\n\n::: {#398f73c9 .cell execution_count=5}\n``` {.python .cell-code}\nents = ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n### Array length\n\nLet's see how many titles each entity is known for, and then show the five\npeople with the largest number of titles they're known for:\n\nThis is computed using the\n[`length`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length)\nAPI on array expressions:\n\n::: {#315e2b5c .cell execution_count=6}\n``` {.python .cell-code}\n(\n ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name      num_titles ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ stringint64      │\n├──────────────────┼────────────┤\n│ Marc Mayer      5 │\n│ Alex Koenigsmark5 │\n│ Sally Sun       5 │\n│ Carrie Schnelker5 │\n│ Henry Townsend  5 │\n└──────────────────┴────────────┘\n
\n```\n:::\n:::\n\n\nIt seems like the length of the `known_for_titles` might be capped at five!\n\n### Index\n\nWe can see the position of `\"actor\"` in `primary_profession`s:\n\n::: {#8f915d17 .cell execution_count=7}\n``` {.python .cell-code}\nents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64                                      │\n├────────────────────────────────────────────┤\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                           │\n└────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\nA return value of `-1` indicates that `\"actor\"` is not present in the value:\n\nLet's look for entities that are not primarily actors:\n\nWe can do this using the\n[`index`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.index)\nmethod by checking whether the position of the string `\"actor\"` is greater than\nzero:\n\n::: {#4335351c .cell execution_count=8}\n``` {.python .cell-code}\nactor_index = ents.primary_profession.index(\"actor\")\nnot_primarily_actors = actor_index > 0\nnot_primarily_actors.mean() # <1>\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=8}\n\n::: {.ansi-escaped-output}\n```{=html}\n
0.01947497010287973
\n```\n:::\n\n:::\n:::\n\n\n1. The average of a `bool` column gives the percentage of `True` values\n\nWho are they?\n\n::: {#4bf604e5 .cell execution_count=9}\n``` {.python .cell-code}\nents[not_primarily_actors]\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name             primary_profession   known_for_titles                    ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                       │\n├────────────┼─────────────────────────┼─────────────────────┼─────────────────────────────────────┤\n│ nm14670573Jamie Young            ['legal', 'actor']['tt27216887']                      │\n│ nm2320563 Miguel Eraso           ['editor', 'actor']['tt1820556', 'tt0823256']          │\n│ nm6050288 Kyle Springford        ['editor', 'actor']['tt3260540', 'tt4353988', ... +1]  │\n│ nm8606771 Edward Wu              ['editor', 'actor']['tt0259354', 'tt4219258']          │\n│ nm8159690 Arash Maleki           ['editor', 'actor']['tt14888266', 'tt5783616', ... +1] │\n│ nm3700713 Wendell Holland        ['editor', 'actor']['tt11546754', 'tt1554553', ... +1] │\n│ nm6531583 Tomás Díez-Kith Atienza['editor', 'actor']['tt3171042', 'tt3749248']          │\n│ nm2456342 Ed Cheesman            ['editor', 'actor']['tt13918214', 'tt9598592', ... +1] │\n│ nm0396397 Thomas Houg            ['editor', 'actor']['tt0093176', 'tt13339954', ... +1] │\n│ nm2171019 Larry Pena             ['editor', 'actor']['tt0831320', 'tt0800017', ... +2]  │\n│                                    │\n└────────────┴─────────────────────────┴─────────────────────┴─────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\nIt's not 100% clear whether the order of elements in `primary_profession` matters here.\n\n### Containment\n\nWe can get people who are **not** actors using `contains`:\n\n::: {#510dc366 .cell execution_count=10}\n``` {.python .cell-code}\nnon_actors = ents[~ents.primary_profession.contains(\"actor\")]\nnon_actors\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name       primary_profession  known_for_titles ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n├────────────┼───────────────────┼────────────────────┼──────────────────┤\n│ nm13331027Robert Allen     ['legal'][]               │\n│ nm11516366Barney Given     ['legal'][]               │\n│ nm7841847 Natalia Utrera   ['legal'][]               │\n│ nm14658368Amber Payne      ['legal'][]               │\n│ nm15199944Melanie Tomanov  ['legal'][]               │\n│ nm11529563David Lazarus    ['legal'][]               │\n│ nm12224896Andrew Winston   ['legal'][]               │\n│ nm7591008 Miles Metcoff    ['legal'][]               │\n│ nm11355058Sameer Oberoi    ['legal'][]               │\n│ nm15069831Skyler R. Peacock['legal'][]               │\n│                 │\n└────────────┴───────────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n### Element removal\n\nWe can remove elements from arrays too.\n\n::: {.callout-note}\n## `remove()` does not mutate the underlying data\n:::\n\nLet's see who only has \"actor\" in the list of their primary professions:\n\n::: {#261a5744 .cell execution_count=11}\n``` {.python .cell-code}\nents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n ]\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name           primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867Danny Brown          ['actor']['tt4532788']                      │\n│ nm7199329James Wyatt Fairbanks['actor']['tt4494580']                      │\n│ nm7201687Tony Jelen           ['actor']['tt2043887']                      │\n│ nm7203397Christian Petrucci   ['actor']['tt4537722']                      │\n│ nm7205107Pablo Schollaert     ['actor']['tt4539222']                      │\n│ nm7207724Shigeru Jerry Endo   ['actor']['tt0043590']                      │\n│ nm7209610Paul Tugwell         ['actor']['tt6185666', 'tt4544182']         │\n│ nm7213017Pancho               ['actor']['tt2333598']                      │\n│ nm7228531Phillip Shinn        ['actor']['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342José María Martínez  ['actor']['tt2244891']                      │\n│                                   │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n### Slicing with square-bracket syntax\n\nLet's remove everyone's first profession from the list, but only if they have\nmore than one profession listed:\n\n::: {#c1137788 .cell execution_count=12}\n``` {.python .cell-code}\nents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=12}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name     primary_profession  known_for_titles                     ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                        │\n├────────────┼─────────────────┼────────────────────┼──────────────────────────────────────┤\n│ nm0520425 Doug Lord      ['legal']['tt0086461']                        │\n│ nm7198767 Martin Harry   ['legal']['tt5554916']                        │\n│ nm2232471 Lee Thomas     ['legal']['tt0236124']                        │\n│ nm5500775 Stewart Hayes  ['legal']['tt2671192']                        │\n│ nm2653478 Aaron Rosenberg['actor']['tt4218260']                        │\n│ nm0701436 Dominic Pye    ['editor']['tt27329996', 'tt0195619']          │\n│ nm12705514Okpata Henry   ['editor']['tt28450328', 'tt15170142', ... +1] │\n│ nm8313644 Jeff Landers   ['editor']['tt0488302']                        │\n│ nm0438282 Joshua Kaplan  ['editor']['tt0110687', 'tt0329600', ... +2]   │\n│ nm2803821 Glen Ring      ['editor']['tt1579300', 'tt1126489', ... +2]   │\n│                                     │\n└────────────┴─────────────────┴────────────────────┴──────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## Set operations and sorting\n\nTreating arrays as sets is possible with the\n[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union)\nand\n[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect)\nAPIs.\n\n### Union\n\n### Intersection\n\nLet's see if we can use array intersection to figure which actors share\nknown-for titles and sort the result:\n\n::: {#b4e2a96d .cell execution_count=13}\n``` {.python .cell-code}\nleft = ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name               together_with                                     ┃\n┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringarray<string>                                     │\n├───────────────────┼───────────────────────────────────────────────────┤\n│ Sandra Murphy    ['Andrew Brisk', 'Angus McLaren', ... +41]        │\n│ Espera           ['Avena Campbell', 'Brenda James', ... +11]       │\n│ Mamta Jajoo      ['Barbara Buls', 'Bill Smoler', ... +72]          │\n│ Charles Ellis    ['Cherri Moore', 'Dennis Montano', ... +11]       │\n│ Chris Nicholus   ['Catherine Harrell', 'George Pounders', ... +11] │\n│ Paul Dembling    ['Barbara Buls', 'Bill Smoler', ... +72]          │\n│ Dnyaneshwar Mulay['Avena Campbell', 'Brenda James', ... +11]       │\n│ Daisy Boria      ['Barbara Buls', 'Bill Smoler', ... +72]          │\n│ Brandon Staley   ['Beacon Light', 'Ben Emanuel', ... +48]          │\n│ Dwayne Carter Jr.['Bill Collis', 'Charlie Jones', ... +21]         │\n│                                                  │\n└───────────────────┴───────────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## Advanced operations\n\n### `unnest`\n\nAs of version 7.0.0 Ibis does not support its native `unnest` API for BigQuery,\nbut we plan to add it in the future.\n\nFor now, you can use `con.sql` to construct an Ibis expression from a BigQuery\nSQL string that contains `UNNEST` calls:\n\nDespite lack of native `UNNEST` support, many use cases for `UNNEST` are met by\nthe\n[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter)\nand\n[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map)\noperations on array expressions.\n\n### Filtering array elements\n\nShow all people who are neither editors nor actors:\n\n::: {#331a9699 .cell execution_count=14}\n``` {.python .cell-code}\nents.mutate(\n primary_profession=_.primary_profession.filter(\n lambda pp: pp.isin((\"actor\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0)\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name           primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867Danny Brown          ['actor']['tt4532788']                      │\n│ nm7199329James Wyatt Fairbanks['actor']['tt4494580']                      │\n│ nm7201687Tony Jelen           ['actor']['tt2043887']                      │\n│ nm7203397Christian Petrucci   ['actor']['tt4537722']                      │\n│ nm7205107Pablo Schollaert     ['actor']['tt4539222']                      │\n│ nm7207724Shigeru Jerry Endo   ['actor']['tt0043590']                      │\n│ nm7209610Paul Tugwell         ['actor']['tt6185666', 'tt4544182']         │\n│ nm7213017Pancho               ['actor']['tt2333598']                      │\n│ nm7228531Phillip Shinn        ['actor']['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342José María Martínez  ['actor']['tt2244891']                      │\n│                                   │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n### Applying a function to array elements\n\nLet's normalize the case of primary_profession to upper case:\n\n::: {#1dd5c0b8 .cell execution_count=15}\n``` {.python .cell-code}\nents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0)\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name           primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867Danny Brown          ['ACTOR']['tt4532788']                      │\n│ nm7199329James Wyatt Fairbanks['ACTOR']['tt4494580']                      │\n│ nm7201687Tony Jelen           ['ACTOR']['tt2043887']                      │\n│ nm7203397Christian Petrucci   ['ACTOR']['tt4537722']                      │\n│ nm7205107Pablo Schollaert     ['ACTOR']['tt4539222']                      │\n│ nm7207724Shigeru Jerry Endo   ['ACTOR']['tt0043590']                      │\n│ nm7209610Paul Tugwell         ['ACTOR']['tt6185666', 'tt4544182']         │\n│ nm7213017Pancho               ['ACTOR']['tt2333598']                      │\n│ nm7228531Phillip Shinn        ['ACTOR']['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342José María Martínez  ['ACTOR']['tt2244891']                      │\n│                                   │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## Conclusion\n\nIbis has a sizable collection of array APIs that work with many different\nbackends and as of version 7.0.0, Ibis supports a much larger set of those APIs\nfor BigQuery!\n\nCheck out [the API\ndocumentation](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue)\nfor the full set of available methods.\n\nTry it out, and let us know what you think.\n\n", + "markdown": "---\ntitle: Working with arrays in Google BigQuery\nauthor: \"Phillip Cloud\"\ndate: \"2023-09-12\"\ncategories:\n - blog\n - bigquery\n - arrays\n - cloud\n---\n\n## Introduction\n\nIbis and BigQuery have [worked well together for years](https://cloud.google.com/blog/products/data-analytics/ibis-and-bigquery-scalable-analytics-comfort-python).\n\nIn Ibis 7.0.0, they work even better together with the addition of array\nfunctionality for BigQuery.\n\nLet's look at some examples using BigQuery's [IMDB sample\ndata](https://developer.imdb.com/non-commercial-datasets/).\n\n## Basics\n\nFirst we'll connect to BigQuery and pluck out a table to work with.\n\nWe'll start with `from ibis.interactive import *` for maximum convenience.\n\n::: {#b789e16f .cell execution_count=1}\n``` {.python .cell-code}\nfrom ibis.interactive import * # <1>\n\ncon = ibis.connect(\"bigquery://ibis-gbq\") # <2>\ncon.set_database(\"bigquery-public-data.imdb\") # <3>\n```\n:::\n\n\n1. `from ibis.interactive import *` imports Ibis APIs into the global namespace\n and enables [interactive mode](../../how-to/configure/basics.qmd#interactive-mode).\n2. Connect to Google BigQuery. Compute (but not storage) is billed to the\n project you connect to--`ibis-gbq` in this case.\n3. Set the database to the project and dataset that we will use for analysis.\n\nLet's look at the tables in this dataset:\n\n::: {#df3b7c1a .cell execution_count=2}\n``` {.python .cell-code}\ncon.tables\n```\n\n::: {.cell-output .cell-output-display execution_count=157}\n```\nTables\n------\n- name_basics\n- reviews\n- title_akas\n- title_basics\n- title_crew\n- title_episode\n- title_principals\n- title_ratings\n```\n:::\n:::\n\n\nLet's pull out the `name_basics` table, which contains names and metadata about\npeople listed on IMDB. We'll call this `ents` (short for `entities`), and remove some\ncolumns we won't need:\n\n::: {#52171a11 .cell execution_count=3}\n``` {.python .cell-code}\nents = con.tables.name_basics.drop(\"birth_year\", \"death_year\")\nents\n```\n\n::: {.cell-output .cell-output-display execution_count=158}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name       primary_profession  known_for_titles    ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringstring              │\n├───────────┼───────────────────┼────────────────────┼─────────────────────┤\n│ nm7195872Amanda Goetz     NULLtt4529508           │\n│ nm7204100Overload         NULLtt4828308,tt4538296 │\n│ nm7206569Carl Winter      NULLNULL                │\n│ nm7208626Doug Goodin      NULLNULL                │\n│ nm7222505Rickard Finndahl NULLtt4519546           │\n│ nm7226759Kenneth Bell     NULLtt3545908           │\n│ nm7227158Savannah Gardner NULLtt4028790           │\n│ nm7246216Elisabeth HofmannNULLtt4586074           │\n│ nm7253303Wisda Febriyanti NULLtt4594232           │\n│ nm7255948Charles Myers    NULLtt2396758           │\n│                    │\n└───────────┴───────────────────┴────────────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n### Splitting strings into arrays\n\nWe can see that `known_for_titles` looks sort of like an array, so let's call\nthe\n[`split`](../../reference/expression-strings.qmd#ibis.expr.types.strings.StringValue.split)\nmethod on that column and replace the existing column:\n\n::: {#269b9de2 .cell execution_count=4}\n``` {.python .cell-code}\nents = ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nents\n```\n\n::: {.cell-output .cell-output-display execution_count=159}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name       primary_profession  known_for_titles           ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>              │\n├───────────┼───────────────────┼────────────────────┼────────────────────────────┤\n│ nm7195872Amanda Goetz     NULL['tt4529508']              │\n│ nm7204100Overload         NULL['tt4828308', 'tt4538296'] │\n│ nm7206569Carl Winter      NULL[]                         │\n│ nm7208626Doug Goodin      NULL[]                         │\n│ nm7222505Rickard Finndahl NULL['tt4519546']              │\n│ nm7226759Kenneth Bell     NULL['tt3545908']              │\n│ nm7227158Savannah Gardner NULL['tt4028790']              │\n│ nm7246216Elisabeth HofmannNULL['tt4586074']              │\n│ nm7253303Wisda Febriyanti NULL['tt4594232']              │\n│ nm7255948Charles Myers    NULL['tt2396758']              │\n│                           │\n└───────────┴───────────────────┴────────────────────┴────────────────────────────┘\n
\n```\n:::\n:::\n\n\nSimilarly for `primary_profession`, since people involved in show business often\nhave more than one responsibility on a project:\n\n::: {#4afaa049 .cell execution_count=5}\n``` {.python .cell-code}\nents = ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n### Array length\n\nLet's see how many titles each entity is known for, and then show the five\npeople with the largest number of titles they're known for:\n\nThis is computed using the\n[`length`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length)\nAPI on array expressions:\n\n::: {#045dab82 .cell execution_count=6}\n``` {.python .cell-code}\n(\n ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=161}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name      num_titles ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ stringint64      │\n├──────────────────┼────────────┤\n│ Sally Sun       5 │\n│ Matthew Kavuma  5 │\n│ Henry Townsend  5 │\n│ Alex Koenigsmark5 │\n│ Carrie Schnelker5 │\n└──────────────────┴────────────┘\n
\n```\n:::\n:::\n\n\nIt seems like the length of the `known_for_titles` might be capped at five!\n\n### Index\n\nWe can see the position of `\"actor\"` in `primary_profession`s:\n\n::: {#e33090eb .cell execution_count=7}\n``` {.python .cell-code}\nents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=162}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64                                      │\n├────────────────────────────────────────────┤\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                           │\n└────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\nA return value of `-1` indicates that `\"actor\"` is not present in the value:\n\nLet's look for entities that are not primarily actors:\n\nWe can do this using the\n[`index`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.index)\nmethod by checking whether the position of the string `\"actor\"` is greater than\nzero:\n\n::: {#d1804a34 .cell execution_count=8}\n``` {.python .cell-code}\nactor_index = ents.primary_profession.index(\"actor\")\nnot_primarily_actors = actor_index > 0\nnot_primarily_actors.mean() # <1>\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=163}\n\n::: {.ansi-escaped-output}\n```{=html}\n
0.019474437073314168
\n```\n:::\n\n:::\n:::\n\n\n1. The average of a `bool` column gives the percentage of `True` values\n\nWho are they?\n\n::: {#4d48ad50 .cell execution_count=9}\n``` {.python .cell-code}\nents[not_primarily_actors]\n```\n\n::: {.cell-output .cell-output-display execution_count=164}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name        primary_profession   known_for_titles                    ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                       │\n├────────────┼────────────────────┼─────────────────────┼─────────────────────────────────────┤\n│ nm2231782 Rene Tovar        ['legal', 'actor']['tt21996928']                      │\n│ nm2250015 Stephen Clark     ['legal', 'actor']['tt1452628', 'tt14372154', ... +2] │\n│ nm0352162 Brett Haber       ['legal', 'actor']['tt1720280', 'tt10928526', ... +2] │\n│ nm12169237Endi Ndini        ['editor', 'actor']['tt4557810', 'tt14137514', ... +2] │\n│ nm14475156Colby White       ['editor', 'actor']['tt26313337']                      │\n│ nm8979480 Bartosz Strusewicz['editor', 'actor']['tt6685946', 'tt7334964', ... +2]  │\n│ nm3116354 Robert Marquis    ['editor', 'actor']['tt8376014', 'tt1283513']          │\n│ nm9962305 Vino Domi         ['editor', 'actor']['tt8680966', 'tt8669176']          │\n│ nm10346617Lucas Oliveira    ['editor', 'actor']['tt9483226', 'tt7216954', ... +1]  │\n│ nm7206820 Prince Sethi      ['editor', 'actor']['tt14396686', 'tt4219300']         │\n│                                    │\n└────────────┴────────────────────┴─────────────────────┴─────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\nIt's not 100% clear whether the order of elements in `primary_profession` matters here.\n\n### Containment\n\nWe can get people who are **not** actors using `contains`:\n\n::: {#327da073 .cell execution_count=10}\n``` {.python .cell-code}\nnon_actors = ents[~ents.primary_profession.contains(\"actor\")]\nnon_actors\n```\n\n::: {.cell-output .cell-output-display execution_count=165}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name           primary_profession  known_for_titles ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n├────────────┼───────────────────────┼────────────────────┼──────────────────┤\n│ nm13613518Silvia Vannini       ['legal'][]               │\n│ nm11482673Umit Yildirim        ['legal'][]               │\n│ nm14796117Kendall Jackson      ['legal'][]               │\n│ nm3922637 Michael J. Douglas   ['legal'][]               │\n│ nm5249145 Christopher Addy     ['legal'][]               │\n│ nm9235293 Baolu Lan            ['legal'][]               │\n│ nm14560328Jean Paul S Voilleque['legal'][]               │\n│ nm11250663Kelly D. Shapiro     ['legal'][]               │\n│ nm11355058Sameer Oberoi        ['legal'][]               │\n│ nm8655635 James Madison        ['legal'][]               │\n│                 │\n└────────────┴───────────────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n### Element removal\n\nWe can remove elements from arrays too.\n\n::: {.callout-note}\n## [`remove()`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.remove) does not mutate the underlying data\n:::\n\nLet's see who only has \"actor\" in the list of their primary professions:\n\n::: {#11189ac7 .cell execution_count=11}\n``` {.python .cell-code}\nents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n ]\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=166}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name        primary_profession  known_for_titles                    ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                       │\n├───────────┼────────────────────┼────────────────────┼─────────────────────────────────────┤\n│ nm7217990Jay Isbell        ['actor']['tt4450682']                       │\n│ nm7218053Eric Crowell      ['actor']['tt4500196']                       │\n│ nm7218081John Wyman        ['actor']['tt4500196']                       │\n│ nm7223556Daniel Hope       ['actor']['tt4558584', 'tt9089514']          │\n│ nm7223623Marcus Troy       ['actor']['tt0120660', 'tt10914400', ... +2] │\n│ nm7241836Havár Csongor     ['actor']['tt4580414']                       │\n│ nm7242608Seigô Uetaki      ['actor']['tt4581192']                       │\n│ nm7245253Mahmoud El Faituri['actor']['tt2849138']                       │\n│ nm7254729Tom Keesey        ['actor']['tt0924155', 'tt0924156', ... +2]  │\n│ nm7280985Gabriel Garcia    ['actor']['tt4629714']                       │\n│                                    │\n└───────────┴────────────────────┴────────────────────┴─────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n### Slicing with square-bracket syntax\n\nLet's remove everyone's first profession from the list, but only if they have\nmore than one profession listed:\n\n::: {#363b1821 .cell execution_count=12}\n``` {.python .cell-code}\nents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=167}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name            primary_profession  known_for_titles           ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>              │\n├────────────┼────────────────────────┼────────────────────┼────────────────────────────┤\n│ nm3146692 Keith Sutton          ['legal']['tt0472984']              │\n│ nm2974992 Dávid Farkas          ['legal']['tt0114301']              │\n│ nm6819046 Walter Batt           ['legal']['tt8592196']              │\n│ nm2231544 Don Steele            ['actor']['tt0818746']              │\n│ nm7267327 Navarro Gray          ['actor']['tt1718437', 'tt7945012'] │\n│ nm7783929 Christopher T. Connell['editor']['tt7510258', 'tt5262988'] │\n│ nm6894982 Jason Robert Moore    ['editor']['tt4177962']              │\n│ nm10526935Damien Mota           ['editor']['tt9889740']              │\n│ nm7151380 Chris Villa           ['editor']['tt4479468']              │\n│ nm7641135 Curt Champagne        ['editor']['tt5097098']              │\n│                           │\n└────────────┴────────────────────────┴────────────────────┴────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## Set operations and sorting\n\nTreating arrays as sets is possible with the\n[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union)\nand\n[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect)\nAPIs.\n\nLet's take a look at `intersect`:\n\n### Intersection\n\nLet's see if we can use array intersection to figure which actors share\nknown-for titles and sort the result:\n\n::: {#11583fdc .cell execution_count=13}\n``` {.python .cell-code}\nleft = ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display execution_count=168}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name                   together_with                                       ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringarray<string>                                       │\n├───────────────────────┼─────────────────────────────────────────────────────┤\n│ Chief Willie Sellars ['Amy Tan', 'Bruce Williams', ... +13]              │\n│ Rainer Spanagel      ['Andra Arnicane', 'Aziz Sheikh', ... +9]           │\n│ Sam Cooper           ['Amy Tan', 'Bruce Williams', ... +10]              │\n│ Michelle Porter      ['Amy Tan', 'Bruce Williams', ... +8]               │\n│ Bryan Carter         ['Andrew Bower', 'Austin Murtha', ... +11]          │\n│ Timothy Johnson      ['Betsy Blaney', 'Brendan Halko', ... +19]          │\n│ Colin McLean         ['Alison Garnham', 'Ashley Fox', ... +20]           │\n│ Jessica Williams     ['Austin Williams']                                 │\n│ Leighann Falcon      ['Aaron Green', 'Alejandro Garza y Garza', ... +71] │\n│ Oldham Tuneless Choir['Alex Matvienko', 'Andy McDonald', ... +38]        │\n│                                                    │\n└───────────────────────┴─────────────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## Advanced operations\n\n### Flatten arrays into rows\n\nThanks to the [tireless\nefforts](https://github.com/tobymao/sqlglot/commit/06e0869e7aa5714d77e6ec763da38d6a422965fa)\nof the [folks](https://github.com/tobymao/sqlglot/graphs/contributors) working\non [`sqlglot`](https://github.com/tobymao/sqlglot), as of version 7.0.0 Ibis\nsupports\n[`unnest`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.unnest)\nfor BigQuery!\n\nYou can use it standalone on a column expression:\n\n::: {#e28c4519 .cell execution_count=14}\n``` {.python .cell-code}\nents.primary_profession.unnest()\n```\n\n::: {.cell-output .cell-output-display execution_count=169}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_profession ┃\n┡━━━━━━━━━━━━━━━━━━━━┩\n│ string             │\n├────────────────────┤\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│                   │\n└────────────────────┘\n
\n```\n:::\n:::\n\n\nYou can also use it in `select`/`mutate` calls to expand the table accordingly:\n\n::: {#99e2daa3 .cell execution_count=15}\n``` {.python .cell-code}\nents.mutate(primary_profession=_.primary_profession.unnest())\n```\n\n::: {.cell-output .cell-output-display execution_count=170}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name        primary_profession  known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>    │\n├───────────┼────────────────────┼────────────────────┼──────────────────┤\n│ nm7211030Josh Berry        actor             ['tt4046896']    │\n│ nm7211205Alan Douglas      actor             ['tt0038449']    │\n│ nm7213536Wilson Recalde    actor             ['tt2333598']    │\n│ nm7214355Julian Owen       actor             ['tt3488298']    │\n│ nm7215983Zach Ladouceur    actor             ['tt4546288']    │\n│ nm7221941Alain Milani      actor             ['tt4548654']    │\n│ nm7225536Gary Flanzer      actor             ['tt4521030']    │\n│ nm7236543Bastiaan Schreuderactor             ['tt4506254']    │\n│ nm7241255Jared Young       actor             ['tt4579992']    │\n│ nm7241835Fagyas Alex       actor             ['tt4580414']    │\n│                 │\n└───────────┴────────────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\nUnnesting can be useful when joining nested data.\n\nHere we use unnest to find people known for any of the godfather movies:\n\n::: {#d3d9d571 .cell execution_count=16}\n``` {.python .cell-code}\nbasics = con.tables.title_basics.filter( # <1>\n [\n _.title_type == \"movie\",\n _.original_title.lower().startswith(\"the godfather\"),\n _.genres.lower().contains(\"crime\"),\n ]\n) # <1>\n\nknown_for_the_godfather = (\n ents.mutate(tconst=_.known_for_titles.unnest()) # <2>\n .join(basics, \"tconst\") # <3>\n .select(\"primary_title\", \"primary_name\") # <4>\n .distinct()\n .order_by([\"primary_title\", \"primary_name\"]) # <4>\n)\nknown_for_the_godfather\n```\n\n::: {.cell-output .cell-output-display execution_count=171}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title  primary_name        ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstring              │\n├───────────────┼─────────────────────┤\n│ The GodfatherA. Emmett Adams     │\n│ The GodfatherAbe Vigoda          │\n│ The GodfatherAl Lettieri         │\n│ The GodfatherAl Martino          │\n│ The GodfatherAl Pacino           │\n│ The GodfatherAlbert S. Ruddy     │\n│ The GodfatherAlex Rocco          │\n│ The GodfatherAndrea Eastman      │\n│ The GodfatherAngelo Infanti      │\n│ The GodfatherAnna Hill Johnstone │\n│                    │\n└───────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n1. Filter the `title_basics` data set to only the Godfather movies\n2. Unnest the `known_for_titles` array column\n3. Join with `basics` to get movie titles\n4. Ensure that each entity is only listed once and sort the results\n\nLet's summarize by showing how many people are known for each Godfather movie:\n\n::: {#7a6153fd .cell execution_count=17}\n``` {.python .cell-code}\nknown_for_the_godfather.primary_title.value_counts()\n```\n\n::: {.cell-output .cell-output-display execution_count=172}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title           primary_title_count ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringint64               │\n├────────────────────────┼─────────────────────┤\n│ The Godfather Part III196 │\n│ The Godfather Part II 117 │\n│ The Godfather         96 │\n└────────────────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n### Filtering array elements\n\nFiltering array elements can be done with the\n[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter)\nmethod, which applies a predicate to each array element and returns an array of\nelements for which the predicate returns `True`.\n\nThis method is similar to Python's\n[`filter`](https://docs.python.org/3.7/library/functions.html#filter) function.\n\nLet's show all people who are neither editors nor actors:\n\n::: {#888e227f .cell execution_count=18}\n``` {.python .cell-code}\nents.mutate(\n primary_profession=_.primary_profession.filter( # <1>\n lambda pp: ~pp.isin((\"actor\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0) # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=173}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name        primary_profession  known_for_titles ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n├────────────┼────────────────────┼────────────────────┼──────────────────┤\n│ nm14701100Karin Roach       ['legal'][]               │\n│ nm3709802 Kristin L. Holland['legal'][]               │\n│ nm13336378Rok Salazar       ['legal'][]               │\n│ nm7514782 Christopher Spicer['legal'][]               │\n│ nm11531194J Manuel          ['legal'][]               │\n│ nm9114713 Huy Nguyen        ['legal'][]               │\n│ nm2230248 Jeffrey Galen     ['legal'][]               │\n│ nm8479496 Fatima Amgane     ['legal'][]               │\n│ nm2229345 Harold Brown      ['legal'][]               │\n│ nm7383201 Ashley Silver     ['legal'][]               │\n│                 │\n└────────────┴────────────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n1. This `filter` call is applied to each array element\n2. This `filter` call is applied to the table\n\n### Applying a function to array elements\n\nYou can apply a function to run an ibis expression on each element of an array\nusing the\n[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map)\nmethod.\n\nLet's normalize the case of primary_profession to upper case:\n\n::: {#b979917c .cell execution_count=19}\n``` {.python .cell-code}\nents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0)\n```\n\n::: {.cell-output .cell-output-display execution_count=174}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name       primary_profession  known_for_titles                    ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                       │\n├───────────┼───────────────────┼────────────────────┼─────────────────────────────────────┤\n│ nm7199328Renzo Castro     ['ACTOR']['tt4623856', 'tt4494580', ... +1]  │\n│ nm7199362Pankaj           ['ACTOR']['tt4367318']                       │\n│ nm7200119Thibault Péan    ['ACTOR']['tt4534250']                       │\n│ nm7203213Tim Goodman      ['ACTOR']['tt2234701']                       │\n│ nm7207130Ruupertti Arponen['ACTOR']['tt0185819', 'tt10628202']         │\n│ nm7223822Federico Carghini['ACTOR']['tt1556087']                       │\n│ nm7232704Ned Jackson      ['ACTOR']['tt0407361', 'tt20453990']         │\n│ nm7238980Rasmus Cassanelli['ACTOR']['tt0782510']                       │\n│ nm7240979Changyuan Zhou   ['ACTOR']['tt0311913']                       │\n│ nm7242332James Dolbeare   ['ACTOR']['tt10151048', 'tt5769738', ... +2] │\n│                                    │\n└───────────┴───────────────────┴────────────────────┴─────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## Conclusion\n\nIbis has a sizable collection of array APIs that work with many different\nbackends and as of version 7.0.0, Ibis supports a much larger set of those APIs\nfor BigQuery!\n\nCheck out [the API\ndocumentation](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue)\nfor the full set of available methods.\n\nTry it out, and let us know what you think.\n\n", "supporting": [ "index_files" ], diff --git a/docs/posts/bigquery-arrays/index.qmd b/docs/posts/bigquery-arrays/index.qmd index 919989cc27aa..005e1346bebc 100644 --- a/docs/posts/bigquery-arrays/index.qmd +++ b/docs/posts/bigquery-arrays/index.qmd @@ -26,20 +26,22 @@ First we'll connect to BigQuery and pluck out a table to work with. We'll start with `from ibis.interactive import *` for maximum convenience. ```{python} -from ibis.interactive import * +from ibis.interactive import * # <1> -con = ibis.connect("bigquery://ibis-gbq") # <1> -con.set_database("bigquery-public-data.imdb") # <2> +con = ibis.connect("bigquery://ibis-gbq") # <2> +con.set_database("bigquery-public-data.imdb") # <3> ``` -1. Connect to the **billing** project. Compute (but not storage) is billed to - this project. -2. Set the database to the project and dataset that we will use for analysis. +1. `from ibis.interactive import *` imports Ibis APIs into the global namespace + and enables [interactive mode](../../how-to/configure/basics.qmd#interactive-mode). +2. Connect to Google BigQuery. Compute (but not storage) is billed to the + project you connect to--`ibis-gbq` in this case. +3. Set the database to the project and dataset that we will use for analysis. Let's look at the tables in this dataset: ```{python} -con.list_tables() +con.tables ``` Let's pull out the `name_basics` table, which contains names and metadata about @@ -136,7 +138,7 @@ non_actors We can remove elements from arrays too. ::: {.callout-note} -## `remove()` does not mutate the underlying data +## [`remove()`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.remove) does not mutate the underlying data ::: Let's see who only has "actor" in the list of their primary professions: @@ -169,7 +171,7 @@ and [`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect) APIs. -### Union +Let's take a look at `intersect`: ### Intersection @@ -197,35 +199,92 @@ shared_titles ## Advanced operations -### `unnest` +### Flatten arrays into rows -As of version 7.0.0 Ibis does not support its native `unnest` API for BigQuery, -but we plan to add it in the future. +Thanks to the [tireless +efforts](https://github.com/tobymao/sqlglot/commit/06e0869e7aa5714d77e6ec763da38d6a422965fa) +of the [folks](https://github.com/tobymao/sqlglot/graphs/contributors) working +on [`sqlglot`](https://github.com/tobymao/sqlglot), as of version 7.0.0 Ibis +supports +[`unnest`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.unnest) +for BigQuery! -For now, you can use `con.sql` to construct an Ibis expression from a BigQuery -SQL string that contains `UNNEST` calls: +You can use it standalone on a column expression: -Despite lack of native `UNNEST` support, many use cases for `UNNEST` are met by -the -[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter) -and -[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map) -operations on array expressions. +```{python} +ents.primary_profession.unnest() +``` + +You can also use it in `select`/`mutate` calls to expand the table accordingly: + +```{python} +ents.mutate(primary_profession=_.primary_profession.unnest()) +``` + +Unnesting can be useful when joining nested data. + +Here we use unnest to find people known for any of the godfather movies: + +```{python} +basics = con.tables.title_basics.filter( # <1> + [ + _.title_type == "movie", + _.original_title.lower().startswith("the godfather"), + _.genres.lower().contains("crime"), + ] +) # <1> + +known_for_the_godfather = ( + ents.mutate(tconst=_.known_for_titles.unnest()) # <2> + .join(basics, "tconst") # <3> + .select("primary_title", "primary_name") # <4> + .distinct() + .order_by(["primary_title", "primary_name"]) # <4> +) +known_for_the_godfather +``` + +1. Filter the `title_basics` data set to only the Godfather movies +2. Unnest the `known_for_titles` array column +3. Join with `basics` to get movie titles +4. Ensure that each entity is only listed once and sort the results + +Let's summarize by showing how many people are known for each Godfather movie: + +```{python} +known_for_the_godfather.primary_title.value_counts() +``` ### Filtering array elements -Show all people who are neither editors nor actors: +Filtering array elements can be done with the +[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter) +method, which applies a predicate to each array element and returns an array of +elements for which the predicate returns `True`. + +This method is similar to Python's +[`filter`](https://docs.python.org/3.7/library/functions.html#filter) function. + +Let's show all people who are neither editors nor actors: ```{python} ents.mutate( - primary_profession=_.primary_profession.filter( - lambda pp: pp.isin(("actor", "editor")) + primary_profession=_.primary_profession.filter( # <1> + lambda pp: ~pp.isin(("actor", "editor")) ) -).filter(_.primary_profession.length() > 0) +).filter(_.primary_profession.length() > 0) # <2> ``` +1. This `filter` call is applied to each array element +2. This `filter` call is applied to the table + ### Applying a function to array elements +You can apply a function to run an ibis expression on each element of an array +using the +[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map) +method. + Let's normalize the case of primary_profession to upper case: ```{python}