diff --git a/docs/_freeze/posts/unix-backend/index/execute-results/html.json b/docs/_freeze/posts/unix-backend/index/execute-results/html.json new file mode 100644 index 000000000000..2ff9000a5f02 --- /dev/null +++ b/docs/_freeze/posts/unix-backend/index/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "39940ef53ab2088e51d68a4b09f6509c", + "result": { + "engine": "jupyter", + "markdown": "---\ntitle: \"Scaling to infinity and beyond: the Unix backend\"\nauthor: \"Phillip Cloud\"\ndate: \"2024-04-01\"\ncategories:\n - blog\n - serious\n - web-scale\n - unix\n---\n\n## The Unix backend for Ibis\n\nWe're happy to announce a new Ibis backend built on the world's best known web\nscale technology: Unix pipes.\n\n## Why?\n\nWhy not? Pipes rock and they automatically stream data between operators and\nscale to your hard drive.\n\nWhat's not to love?\n\n## Demo\n\nAll production ready backends ship with amazing demos.\n\nThe Unix backend is no different. Let's see it in action.\n\nFirst we'll install the Unix backend.\n\n```bash\npip install ibish\n```\n\nLike all production-ready libraries `ibish` depends on the latest commit of `ibis-framework`.\n\nNext we'll download some data.\n\n::: {#3d238b30 .cell execution_count=1}\n``` {.python .cell-code}\n!curl -LsS 'https://storage.googleapis.com/ibis-examples/penguins/20240322T125036Z-9aae2/penguins.csv.gz' | zcat > penguins.csv\n```\n:::\n\n\n::: {#4cabee38 .cell execution_count=2}\n``` {.python .cell-code}\nimport ibis\nimport ibish\n\n\nibis.options.interactive = True\n\nunix = ibish.connect({\"p\": \"penguins.csv\"})\n\nt = unix.table(\"p\")\nt\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringfloat64float64float64float64stringint64 │\n├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie Torgersen39.118.7181.03750.0male  2007 │\n│ Adelie Torgersen39.517.4186.03800.0female2007 │\n│ Adelie Torgersen40.318.0195.03250.0female2007 │\n│ Adelie TorgersenNULLNULLNULLNULLNULL2007 │\n│ Adelie Torgersen36.719.3193.03450.0female2007 │\n│ Adelie Torgersen39.320.6190.03650.0male  2007 │\n│ Adelie Torgersen38.917.8181.03625.0female2007 │\n│ Adelie Torgersen39.219.6195.04675.0male  2007 │\n│ Adelie Torgersen34.118.1193.03475.0NULL2007 │\n│ Adelie Torgersen42.020.2190.04250.0NULL2007 │\n│  │\n└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\nSweet, huh?\n\nLet's filter the data and look at only the year 2009.\n\n::: {#321eac96 .cell execution_count=3}\n``` {.python .cell-code}\nexpr = t.filter(t.year == 2009)\nexpr\n```\n\n::: {.cell-output .cell-output-display execution_count=3}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringfloat64float64float64float64stringint64 │\n├─────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie Biscoe35.017.9192.03725.0female2009 │\n│ Adelie Biscoe41.020.0203.04725.0male  2009 │\n│ Adelie Biscoe37.716.0183.03075.0female2009 │\n│ Adelie Biscoe37.820.0190.04250.0male  2009 │\n│ Adelie Biscoe37.918.6193.02925.0female2009 │\n│ Adelie Biscoe39.718.9184.03550.0male  2009 │\n│ Adelie Biscoe38.617.2199.03750.0female2009 │\n│ Adelie Biscoe38.220.0190.03900.0male  2009 │\n│ Adelie Biscoe38.117.0181.03175.0female2009 │\n│ Adelie Biscoe43.219.0197.04775.0male  2009 │\n│  │\n└─────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\nWe can sort the result of that too, and filter again.\n\n::: {#efe8d4ef .cell execution_count=4}\n``` {.python .cell-code}\nexpr = (\n expr.order_by(\"species\", ibis.desc(\"bill_length_mm\"))\n .filter(lambda t: t.island == \"Biscoe\")\n)\nexpr\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringfloat64float64float64float64stringint64 │\n├─────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie Biscoe45.620.3191.04600.0male  2009 │\n│ Adelie Biscoe43.219.0197.04775.0male  2009 │\n│ Adelie Biscoe42.718.3196.04075.0male  2009 │\n│ Adelie Biscoe42.219.5197.04275.0male  2009 │\n│ Adelie Biscoe41.020.0203.04725.0male  2009 │\n│ Adelie Biscoe39.717.7193.03200.0female2009 │\n│ Adelie Biscoe39.718.9184.03550.0male  2009 │\n│ Adelie Biscoe39.620.7191.03900.0female2009 │\n│ Adelie Biscoe38.617.2199.03750.0female2009 │\n│ Adelie Biscoe38.220.0190.03900.0male  2009 │\n│  │\n└─────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\nThere's even support for joins and aggregations!\n\nLet's count the number of island, species pairs and sort descending by the count.\n\n::: {#b67fd46d .cell execution_count=5}\n``` {.python .cell-code}\nexpr = (\n t.group_by(\"island\", \"species\")\n .agg(n=lambda t: t.count())\n .order_by(ibis.desc(\"n\"))\n)\nexpr\n```\n\n::: {.cell-output .cell-output-display execution_count=5}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓\n┃ island     species    n     ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩\n│ stringstringint64 │\n├───────────┼───────────┼───────┤\n│ Biscoe   Gentoo   124 │\n│ Dream    Chinstrap68 │\n│ Dream    Adelie   56 │\n│ TorgersenAdelie   52 │\n│ Biscoe   Adelie   44 │\n└───────────┴───────────┴───────┘\n
\n```\n:::\n:::\n\n\nFor kicks, let's compare that to the DuckDB backend to make sure we're able to count stuff.\n\nTo be extra awesome, we'll *reuse the same expression to do the computation*.\n\n::: {#1412a8fa .cell execution_count=6}\n``` {.python .cell-code}\nddb = ibis.duckdb.connect()\nddb.read_csv(\"penguins.csv\", table_name=\"p\") # <1>\nibis.memtable(ddb.to_pyarrow(expr.unbind()))\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓\n┃ island     species    n     ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩\n│ stringstringint64 │\n├───────────┼───────────┼───────┤\n│ Biscoe   Gentoo   124 │\n│ Dream    Chinstrap68 │\n│ Dream    Adelie   56 │\n│ TorgersenAdelie   52 │\n│ Biscoe   Adelie   44 │\n└───────────┴───────────┴───────┘\n
\n```\n:::\n:::\n\n\n1. The `read_csv` is necessary so that the expression's table\n name--`p`--matches one inside the DuckDB database.\n\n## How does it work?\n\nGlad you asked!\n\nThe Unix backend for Ibis was built over the course of a few hours, which is\nabout the time it takes to make a production ready Ibis backend.\n\nBroadly speaking, the Unix backend:\n\n1. Produces a shell command for each Ibis _table_ operation.\n1. Produces a nominal output location for the output of that command, in the form of a [named pipe](https://en.wikipedia.org/wiki/Named_pipe) opened in write mode.\n1. Reads output from the named pipe output location of the root of the expression tree.\n1. Calls `pandas.read_csv` on that output.\n\n::: {.callout-note collapse=\"true\"}\n# Why named pipes?\n\nShell commands only allow a single input from `stdin`.\n\nHowever, joins accept > 1 input so we need a way to stream more than one input to a join operation.\n\nNamed pipes support the semantics of \"unnamed\" pipes (FIFO queue behavior) but\ncan be used in pipelines with nodes that have more a single input since they\nexist as paths on the file system.\n:::\n\n### Expressions\n\nIbis expressions are an abstract representation of an analytics computation\nover tabular data.\n\nIbis ships a public API, whose instances we call *expressions*.\n\nExpressions have an associated type--accessible via their\n[`type()`](../../reference/expression-generic.qmd#ibis.expr.types.generic.Value.type)\nmethod--that determines what methods are available on them.\n\nExpressions are ignorant of their underlying implementation: their\ncomposability is determined solely by their type.\n\nThis type is determined by the expression's underlying *operation*.\n\nThe two-layer model makes it easy to describe operations in terms of the data\ntypes produced by an expression, rather than as instances of a specific class\nin a hierarchy.\n\nThis allows Ibis maintainers to alter expression API implementations without\nchanging those APIs making it easier to maintain and easier to keep stable than\nif we had a complex (but not necessarily deep!) class hierarchy.\n\nOperations, though, are really where the nitty gritty implementation details\nstart.\n\n### Operations\n\nIbis _operations_ are lightweight classes that model the tree structure of a computation.\n\nThey have zero or more inputs, whose types and values are constrained by Ibis's _type system_.\n\nNotably operations are *not* part of Ibis's public API.\n\nWhen we talk about \"compilation\" in Ibis, we're talking about the process of\nconverting an _operation_ into something that the backend knows how to execute.\n\nIn the case of this 1̵0̸0̵%̵ p̶̺̑r̴̛ͅo̵̒ͅḍ̴̌u̷͇͒c̵̠̈t̷͍̿i̶̪͐o̸̳̾n̷͓̄-r̵̡̫̞͓͆̂̏ẽ̸̪̱̽ͅā̸̤̹̘̅̓͝d̵͇̞̏̂̔̽y̴̝͎̫̬͋̇̒̅ Unix backend, each operation\nis compiled into a list of strings that represent the shell command to run to\nexecute the operation.\n\nIn other backends, like DuckDB, these compilation rules produce a sqlglot object.\n\nThe `compile` method is also the place where the backend has a chance to invoke\ncustom rewrite rules over operations.\n\nRewrites are a very useful tool for the Unix backend. For example, the `join`\ncommand (yep, it's in coreutils!) that we use to execute inner joins with this\nbackend requires that the inputs be sorted, otherwise the results won't be\ncorrect. So, I added a rewrite rule that replaces the left and right relations\nin a join operation with equivalent relations sorted on the join keys.\n\nOnce you obtain the output of compile, it's up to the backend what to do next.\n\n### Backend implementation\n\nAt this point we've got our shell commands and some output locations created as\nnamed pipes.\n\nWhat next?\n\nWell, we need to execute the commands and write their output to the corresponding named pipe.\n\nYou might think\n\n> I'll just loop over the operations, open the pipe in write mode and call\n> `subprocess.Popen(cmd, stdout=named_pipe)`.\n\nNot a bad thought, but the semantics of named pipes do not abide such thoughts :)\n\nNamed pipes, when opened in write mode, will block until a corresponding handle\nis opened in *read* mode.\n\nFutures using a scoped thread pool are a decent way to handle this.\n\nThe idea is to launch every node concurrently and then read from the last\nnode's output. This initial read of the root node's output pipe kicks off the\ncascade of other reads necessary to move data through the pipeline.\n\nThe Unix backend thus constructs a scoped `ThreadPoolExecutor()` using\na context manager and submits a task for each operation to the executor.\nImportantly, opening the named pipe in write mode happens **inside** the task,\nto avoid blocking the main thread while waiting for a reader to be opened.\n\nThe final output task's path is then passed directly to `read_csv`, and we've\nnow got the result of our computation.\n\n#### Show me the commands already!\n\nRoger that.\n\n::: {#9cba0561 .cell execution_count=7}\n``` {.python .cell-code}\nexpr = (\n t.filter([t.year == 2009])\n .select(\n \"year\", \"species\", \"flipper_length_mm\", island=lambda t: t.island.lower()\n )\n .group_by(\"island\", \"species\")\n .agg(n=lambda t: t.count(), avg=lambda t: t.island.upper().length().mean())\n .order_by(\"n\")\n .mutate(ilength=lambda t: t.island.length())\n .limit(5)\n)\nprint(unix.explain(expr)) # <1>\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntail --lines +2 /home/cloud/src/ibis/docs/posts/unix-backend/penguins.csv > t0\nawk -F , '{ if (($8 == 2009)) { print }}' t0 > t1\nawk -F , '{ print $8 \",\" $1 \",\" $5 \",\" tolower($2) }' t1 > t2\nawk -F , '{\n agg0[$4\",\"$2]++\n agg1[$4\",\"$2] += length(toupper($4))\n}\nEND { for (key in agg0) print key \",\" agg0[key] \",\" agg1[key]/NR }' t2 > t3\nsort -t , -k 3,3n t3 > t4\nawk -F , '{ print $1 \",\" $2 \",\" $3 \",\" $4 \",\" length($1) }' t4 > t5\nhead --lines 5 t5 > t6\n```\n:::\n:::\n\n\n1. `explain` isn't a public method and not likely to become one any time soon.\n\n## Conclusion\n\nIf you've gotten this far hopefully you've had a good laugh.\n\nLet's wrap up with some final thoughts.\n\n### Things to do\n\n- Join our [Zulip](https://ibis-project.zulipchat.com/)!\n- Open a GitHub [issue](https://github.com/ibis-project/ibis/issues/new/choose)\n or [discussion](https://github.com/ibis-project/ibis/discussions/new/choose)!\n\n### Things to avoid doing\n\n- Putting this into production\n\n", + "supporting": [ + "index_files" + ], + "filters": [], + "includes": { + "include-in-header": [ + "\n\n\n" + ] + } + } +} \ No newline at end of file diff --git a/docs/posts/unix-backend/index.qmd b/docs/posts/unix-backend/index.qmd new file mode 100644 index 000000000000..83f4ccb938f1 --- /dev/null +++ b/docs/posts/unix-backend/index.qmd @@ -0,0 +1,251 @@ +--- +title: "Scaling to infinity and beyond: the Unix backend" +author: "Phillip Cloud" +date: "2024-04-01" +categories: + - blog + - serious + - web-scale + - unix +--- + +## The Unix backend for Ibis + +We're happy to announce a new Ibis backend built on the world's best known web +scale technology: Unix pipes. + +## Why? + +Why not? Pipes rock and they automatically stream data between operators and +scale to your hard drive. + +What's not to love? + +## Demo + +All production ready backends ship with amazing demos. + +The Unix backend is no different. Let's see it in action. + +First we'll install the Unix backend. + +```bash +pip install ibish +``` + +Like all production-ready libraries `ibish` depends on the latest commit of `ibis-framework`. + +Next we'll download some data. + +```{python} +!curl -LsS 'https://storage.googleapis.com/ibis-examples/penguins/20240322T125036Z-9aae2/penguins.csv.gz' | zcat > penguins.csv +``` + +```{python} +import ibis +import ibish + + +ibis.options.interactive = True + +unix = ibish.connect({"p": "penguins.csv"}) + +t = unix.table("p") +t +``` + +Sweet, huh? + +Let's filter the data and look at only the year 2009. + +```{python} +expr = t.filter(t.year == 2009) +expr +``` + +We can sort the result of that too, and filter again. + +```{python} +expr = ( + expr.order_by("species", ibis.desc("bill_length_mm")) + .filter(lambda t: t.island == "Biscoe") +) +expr +``` + +There's even support for joins and aggregations! + +Let's count the number of island, species pairs and sort descending by the count. + +```{python} +expr = ( + t.group_by("island", "species") + .agg(n=lambda t: t.count()) + .order_by(ibis.desc("n")) +) +expr +``` + +For kicks, let's compare that to the DuckDB backend to make sure we're able to count stuff. + +To be extra awesome, we'll *reuse the same expression to do the computation*. + +```{python} +ddb = ibis.duckdb.connect() +ddb.read_csv("penguins.csv", table_name="p") # <1> +ibis.memtable(ddb.to_pyarrow(expr.unbind())) +``` + +1. The `read_csv` is necessary so that the expression's table + name--`p`--matches one inside the DuckDB database. + +## How does it work? + +Glad you asked! + +The Unix backend for Ibis was built over the course of a few hours, which is +about the time it takes to make a production ready Ibis backend. + +Broadly speaking, the Unix backend: + +1. Produces a shell command for each Ibis _table_ operation. +1. Produces a nominal output location for the output of that command, in the form of a [named pipe](https://en.wikipedia.org/wiki/Named_pipe) opened in write mode. +1. Reads output from the named pipe output location of the root of the expression tree. +1. Calls `pandas.read_csv` on that output. + +::: {.callout-note collapse="true"} +# Why named pipes? + +Shell commands only allow a single input from `stdin`. + +However, joins accept > 1 input so we need a way to stream more than one input to a join operation. + +Named pipes support the semantics of "unnamed" pipes (FIFO queue behavior) but +can be used in pipelines with nodes that have more a single input since they +exist as paths on the file system. +::: + +### Expressions + +Ibis expressions are an abstract representation of an analytics computation +over tabular data. + +Ibis ships a public API, whose instances we call *expressions*. + +Expressions have an associated type--accessible via their +[`type()`](../../reference/expression-generic.qmd#ibis.expr.types.generic.Value.type) +method--that determines what methods are available on them. + +Expressions are ignorant of their underlying implementation: their +composability is determined solely by their type. + +This type is determined by the expression's underlying *operation*. + +The two-layer model makes it easy to describe operations in terms of the data +types produced by an expression, rather than as instances of a specific class +in a hierarchy. + +This allows Ibis maintainers to alter expression API implementations without +changing those APIs making it easier to maintain and easier to keep stable than +if we had a complex (but not necessarily deep!) class hierarchy. + +Operations, though, are really where the nitty gritty implementation details +start. + +### Operations + +Ibis _operations_ are lightweight classes that model the tree structure of a computation. + +They have zero or more inputs, whose types and values are constrained by Ibis's _type system_. + +Notably operations are *not* part of Ibis's public API. + +When we talk about "compilation" in Ibis, we're talking about the process of +converting an _operation_ into something that the backend knows how to execute. + +In the case of this 1̵0̸0̵%̵ p̶̺̑r̴̛ͅo̵̒ͅḍ̴̌u̷͇͒c̵̠̈t̷͍̿i̶̪͐o̸̳̾n̷͓̄-r̵̡̫̞͓͆̂̏ẽ̸̪̱̽ͅā̸̤̹̘̅̓͝d̵͇̞̏̂̔̽y̴̝͎̫̬͋̇̒̅ Unix backend, each operation +is compiled into a list of strings that represent the shell command to run to +execute the operation. + +In other backends, like DuckDB, these compilation rules produce a sqlglot object. + +The `compile` method is also the place where the backend has a chance to invoke +custom rewrite rules over operations. + +Rewrites are a very useful tool for the Unix backend. For example, the `join` +command (yep, it's in coreutils!) that we use to execute inner joins with this +backend requires that the inputs be sorted, otherwise the results won't be +correct. So, I added a rewrite rule that replaces the left and right relations +in a join operation with equivalent relations sorted on the join keys. + +Once you obtain the output of compile, it's up to the backend what to do next. + +### Backend implementation + +At this point we've got our shell commands and some output locations created as +named pipes. + +What next? + +Well, we need to execute the commands and write their output to the corresponding named pipe. + +You might think + +> I'll just loop over the operations, open the pipe in write mode and call +> `subprocess.Popen(cmd, stdout=named_pipe)`. + +Not a bad thought, but the semantics of named pipes do not abide such thoughts :) + +Named pipes, when opened in write mode, will block until a corresponding handle +is opened in *read* mode. + +Futures using a scoped thread pool are a decent way to handle this. + +The idea is to launch every node concurrently and then read from the last +node's output. This initial read of the root node's output pipe kicks off the +cascade of other reads necessary to move data through the pipeline. + +The Unix backend thus constructs a scoped `ThreadPoolExecutor()` using +a context manager and submits a task for each operation to the executor. +Importantly, opening the named pipe in write mode happens **inside** the task, +to avoid blocking the main thread while waiting for a reader to be opened. + +The final output task's path is then passed directly to `read_csv`, and we've +now got the result of our computation. + +#### Show me the commands already! + +Roger that. + +```{python} +expr = ( + t.filter([t.year == 2009]) + .select( + "year", "species", "flipper_length_mm", island=lambda t: t.island.lower() + ) + .group_by("island", "species") + .agg(n=lambda t: t.count(), avg=lambda t: t.island.upper().length().mean()) + .order_by("n") + .mutate(ilength=lambda t: t.island.length()) + .limit(5) +) +print(unix.explain(expr)) # <1> +``` + +1. `explain` isn't a public method and not likely to become one any time soon. + +## Conclusion + +If you've gotten this far hopefully you've had a good laugh. + +Let's wrap up with some final thoughts. + +### Things to do + +- Join our [Zulip](https://ibis-project.zulipchat.com/)! +- Open a GitHub [issue](https://github.com/ibis-project/ibis/issues/new/choose) + or [discussion](https://github.com/ibis-project/ibis/discussions/new/choose)! + +### Things to avoid doing + +- Putting this into production