diff --git a/docs/_freeze/posts/1tbc/index/execute-results/html.json b/docs/_freeze/posts/1tbc/index/execute-results/html.json new file mode 100644 index 000000000000..6c87b8cae8ac --- /dev/null +++ b/docs/_freeze/posts/1tbc/index/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "dd199dc50ff61d602ee3b185a476aad8", + "result": { + "engine": "jupyter", + "markdown": "---\ntitle: \"Querying 1TB on a laptop with Python dataframes\"\nauthor: \"Cody Peterson\"\ndate: \"2024-07-08\"\nimage: ibis-duckdb-sort.gif\ncategories:\n - benchmark\n - duckdb\n - datafusion\n - polars\n---\n\n***TPC-H benchmark at `sf=1024` via DuckDB, DataFusion, and Polars on a MacBook\nPro with 96GiB of RAM.***\n\n---\n\npandas requires your dataframe to fit in memory. Out-of-memory (OOM) errors are\ncommon when working on larger datasets, though the corresponding size of data on\ndisk can be surprising. The creator of pandas and Ibis noted in [\"Apache\nArrow and the '10 Things I Hate About\npandas'\"](https://wesmckinney.com/blog/apache-arrow-pandas-internals):\n\n> To put it simply, **we weren’t thinking about analyzing 100 GB or 1 TB datasets\n> in 2011**. [In 2017], my rule of thumb for pandas is that **you should have 5 to\n> 10 times as much RAM as the size of your dataset**. So if you have a 10 GB\n> dataset, you should really have about 64, preferably 128 GB of RAM if you want\n> to avoid memory management problems. This comes as a shock to users who expect\n> to be able to analyze datasets that are within a factor of 2 or 3 the size of\n> their computer’s RAM.\n\nToday with Ibis you can reliably and efficiently process a 1TB dataset on a\nlaptop with <1/10th the RAM.\n\n:::{.callout-important}\nThis represents **a 50-100X improvement** in RAM requirements for Python\ndataframes in just 7 years thanks to [composable data\nsystems](https://wesmckinney.com/blog/looking-back-15-years) and [hard work by\nthe DuckDB team](https://duckdb.org/2024/06/26/benchmarks-over-time).\n:::\n\n## Exploring the data with Python dataframes\n\nI've generated ~1TB (`sf=1024`) of [TPC-H data](https://www.tpc.org/tpch) on my\nMacBook Pro with 96 GiB of RAM. We'll start exploring it with pandas, Polars,\nand Ibis and discuss where and why they start to struggle.\n\n:::{.callout-tip title=\"Generating the data\" collapse=\"true\"}\nSee [the previous post](../ibis-bench/index.qmd#reproducing-the-benchmark) for\ninstructions on generating the data. I used `bench gen-data -s 1024 -n 128`,\npartitioning the data to avoid OOM errors while it generated.\n\nI'd recommend instead generating a smaller scale factor and copying it as many\ntimes as needed, as generating the data at `sf=1024` can take a long time.\n:::\n\nTo follow along, install the required packages:\n\n```bash\npip install pandas 'ibis-framework[duckdb,datafusion]' polars-u64-idx plotly\n```\n\n:::{.callout-note title=\"Why polars-u64-idx?\" collapse=\"true\"}\nWe need to use `polars-u64-idx` instead of `polars` [to work with >4.2 billion\nrows](https://docs.pola.rs/user-guide/installation/#big-index).\n:::\n\nImports and setup:\n\n::: {#10f98bb4 .cell execution_count=2}\n``` {.python .cell-code}\nimport os\nimport glob\nimport ibis\nimport pandas as pd\nimport polars as pl\nimport plotly.express as px\n\npx.defaults.template = \"plotly_dark\"\nibis.options.interactive = True\n```\n:::\n\n\n\n\nLet's check the number of rows across all tables in the TPC-H data:\n\n::: {#4582c6e3 .cell execution_count=4}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to get number of rows in TPC-H data\"}\nsf = 1024\nn = 128\ndata_dir = f\"tpch_data/parquet/sf={sf}/n={n}\"\ntables = glob.glob(f\"{data_dir}/*\")\n\ntotal_rows = 0\n\nfor table in tables:\n t = ibis.read_parquet(f\"{table}/*.parquet\")\n total_rows += t.count().to_pyarrow().as_py()\n\nprint(f\"total rows: {total_rows:,}\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntotal rows: 8,867,848,906\n```\n:::\n:::\n\n\nOver 8.8 billion rows!\n\nWe can compute and visualize the sizes of the tables in the TPC-H data (as\ncompressed Parquet files on disk):\n\n::: {#93ee34d9 .cell execution_count=5}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to get sizes of tables in TPC-H data\"}\ndef get_dir_size(path):\n from pathlib import Path\n\n return sum(p.stat().st_size for p in Path(path).rglob(\"*\") if p.is_file())\n\n\nsizes = [get_dir_size(table) for table in tables]\nnames = [os.path.basename(table) for table in tables]\n\ntmp = ibis.memtable({\"name\": names, \"size\": sizes})\ntmp = tmp.mutate(size_gb=tmp[\"size\"] / (1024**3))\ntmp = tmp.mutate(size_gb_mem=tmp[\"size_gb\"] * 11 / 5)\ntmp = tmp.order_by(ibis.desc(\"size_gb\"))\n\nc = px.bar(\n tmp,\n x=\"name\",\n y=\"size_gb\",\n title=\"table sizes in TPC-H data\",\n hover_data=[\"size_gb_mem\"],\n labels={\n \"name\": \"table name\",\n \"size_gb\": \"size (GB on-disk in compressed Parquet files)\",\n \"size_gb_mem\": \"size (approximate GB in memory)\",\n },\n)\n\nprint(\n f\"total size: {tmp['size_gb'].sum().to_pyarrow().as_py():,.2f}GBs (compressed Parquet files)\"\n)\nc\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntotal size: 407.40GBs (compressed Parquet files)\n```\n:::\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nIn-memory this would be about 1TB. Uncompressed CSV files would be >1TB on disk.\n\nLet's explore the largest table, `lineitem`. This table in memory is ~6X larger\nthan RAM.\n\n::: {#def68271 .cell execution_count=6}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to explore the lineitem table\"}\ntable_name = \"lineitem\"\ndata = f\"{data_dir}/{table_name}/*.parquet\"\n\nt = ibis.read_parquet(data)\nprint(f\"rows: {t.count().to_pyarrow().as_py():,} | columns: {len(t.columns)}\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nrows: 6,144,008,876 | columns: 18\n```\n:::\n:::\n\n\nOver 6 billion rows!\n\nLet's try to display the first few rows with Ibis, pandas, and Polars:\n\n::: {.panel-tabset}\n\n## Ibis\n\n::: {#01bda6fa .cell execution_count=7}\n``` {.python .cell-code}\nt = ibis.read_parquet(data)\nt.head(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┓\n┃ l_orderkey  l_partkey  l_suppkey  l_linenumber  l_quantity      l_extendedprice  l_discount      l_tax           l_returnflag  l_linestatus  l_shipdate  l_commitdate  l_receiptdate  l_shipinstruct     l_shipmode  l_comment                           n      sf    ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━┩\n│ int64int64int64int64decimal(15, 2)decimal(15, 2)decimal(15, 2)decimal(15, 2)stringstringdatedatedatestringstringstringint64int64 │\n├────────────┼───────────┼───────────┼──────────────┼────────────────┼─────────────────┼────────────────┼────────────────┼──────────────┼──────────────┼────────────┼──────────────┼───────────────┼───────────────────┼────────────┼────────────────────────────────────┼───────┼───────┤\n│          11589138897873905117.0032213.980.040.02N           O           1996-03-131996-02-121996-03-22DELIVER IN PERSONTRUCK     to beans x-ray carefull           1281024 │\n│          1689244987484499236.0054685.800.090.06N           O           1996-04-121996-02-281996-04-20TAKE BACK RETURN MAIL       according to the final foxes. qui1281024 │\n│          165228571378857238.0011970.480.100.02N           O           1996-01-291996-03-051996-01-31TAKE BACK RETURN REG AIR   ourts cajole above the furiou     1281024 │\n└────────────┴───────────┴───────────┴──────────────┴────────────────┴─────────────────┴────────────────┴────────────────┴──────────────┴──────────────┴────────────┴──────────────┴───────────────┴───────────────────┴────────────┴────────────────────────────────────┴───────┴───────┘\n
\n```\n:::\n:::\n\n\n## pandas\n\n```{.python}\ndf = pd.concat([pd.read_parquet(f) for f in glob.glob(data)], ignore_index=True) # <1>\ndf.head(3)\n```\n\n1. Work around lack of reading multiple parquet files in pandas\n\n```html\nThe Kernel crashed while executing code in the current cell or a previous cell.\nPlease review the code in the cell(s) to identify a possible cause of the failure.\nClick here for more info.\nView Jupyter log for further details.\n```\n\n## Polars (eager)\n\n```{.python}\ndf = pl.read_parquet(data)\ndf.head(3)\n```\n\n```html\nThe Kernel crashed while executing code in the current cell or a previous cell.\nPlease review the code in the cell(s) to identify a possible cause of the failure.\nClick here for more info.\nView Jupyter log for further details.\n```\n\n## Polars (lazy)\n\n::: {#37880636 .cell execution_count=8}\n``` {.python .cell-code}\ndf = pl.scan_parquet(data)\ndf.head(3).collect()\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
\nshape: (3, 16)
l_orderkeyl_partkeyl_suppkeyl_linenumberl_quantityl_extendedpricel_discountl_taxl_returnflagl_linestatusl_shipdatel_commitdatel_receiptdatel_shipinstructl_shipmodel_comment
i64i64i64i64decimal[15,2]decimal[15,2]decimal[15,2]decimal[15,2]strstrdatedatedatestrstrstr
11589138897873905117.0032213.980.040.02"N""O"1996-03-131996-02-121996-03-22"DELIVER IN PERSON""TRUCK""to beans x-ray carefull"
1689244987484499236.0054685.800.090.06"N""O"1996-04-121996-02-281996-04-20"TAKE BACK RETURN""MAIL"" according to the final foxes.…
165228571378857238.0011970.480.100.02"N""O"1996-01-291996-03-051996-01-31"TAKE BACK RETURN""REG AIR""ourts cajole above the furiou"
\n```\n:::\n:::\n\n\n## Polars (lazy, streaming)\n\n::: {#e85a3292 .cell execution_count=9}\n``` {.python .cell-code}\ndf = pl.scan_parquet(data)\ndf.head(3).collect(streaming=True)\n```\n\n::: {.cell-output .cell-output-display execution_count=8}\n```{=html}\n
\nshape: (3, 16)
l_orderkeyl_partkeyl_suppkeyl_linenumberl_quantityl_extendedpricel_discountl_taxl_returnflagl_linestatusl_shipdatel_commitdatel_receiptdatel_shipinstructl_shipmodel_comment
i64i64i64i64decimal[15,2]decimal[15,2]decimal[15,2]decimal[15,2]strstrdatedatedatestrstrstr
11589138897873905117.0032213.980.040.02"N""O"1996-03-131996-02-121996-03-22"DELIVER IN PERSON""TRUCK""to beans x-ray carefull"
1689244987484499236.0054685.800.090.06"N""O"1996-04-121996-02-281996-04-20"TAKE BACK RETURN""MAIL"" according to the final foxes.…
165228571378857238.0011970.480.100.02"N""O"1996-01-291996-03-051996-01-31"TAKE BACK RETURN""REG AIR""ourts cajole above the furiou"
\n```\n:::\n:::\n\n\n:::\n\nIbis, with the default backend of DuckDB, can display the first few rows. Polars\n(lazy) can too in regular and streaming mode. For lazily computation, an\nunderlying query engine has the opportunity to determine a subset of data to be\nread into memory that satisfies a given query. For example, to display any three\nrows from the `lineitem` table it can just read the first three rows from the\nfirst Parquet file in the dataset.\n\nBoth pandas and Polars (eager) crash Python as they must load all the data into\nmemory to construct their dataframes. This is expected because the table in\nmemory ~6X larger than our 96GiB of RAM.\n\n:::{.callout-tip title=\"Visualize the Ibis expression tree\" collapse=\"true\"}\n\n::: {#84082a56 .cell execution_count=10}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to visualize the Ibis expression tree\"}\nfrom ibis.expr.visualize import to_graph\n\nto_graph(t.head(3))\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n![](index_files/figure-html/cell-10-output-1.svg){}\n:::\n:::\n\n\n:::\n\nLet's try something more challenging: [partially\nsorting](https://en.wikipedia.org/wiki/Partial_sorting) the `lineitem` table.\nThis forces at least some columns from all rows of data to pass through the\nquery engine to determine the top 3 rows per the specified ordering. Since the\ndata is larger than RAM, only \"streaming\" engines can handle this. We'll try\nwith the methods that worked on the previous query and add in the DataFusion\nbackend for Ibis.\n\n::: {.panel-tabset}\n\n## Ibis (DuckDB)\n\n```{.python}\nibis.set_backend(\"duckdb\")\nt = ibis.read_parquet(data)\nt.order_by(t[\"l_orderkey\"], t[\"l_partkey\"], t[\"l_suppkey\"]).head(3)\n```\n\n::: {#2370b7da .cell execution_count=11}\n\n::: {.cell-output .cell-output-display execution_count=10}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┓\n┃ l_orderkey  l_partkey  l_suppkey  l_linenumber  l_quantity      l_extendedprice  l_discount      l_tax           l_returnflag  l_linestatus  l_shipdate  l_commitdate  l_receiptdate  l_shipinstruct     l_shipmode  l_comment                 n      sf    ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━┩\n│ int64int64int64int64decimal(15, 2)decimal(15, 2)decimal(15, 2)decimal(15, 2)stringstringdatedatedatestringstringstringint64int64 │\n├────────────┼───────────┼───────────┼──────────────┼────────────────┼─────────────────┼────────────────┼────────────────┼──────────────┼──────────────┼────────────┼──────────────┼───────────────┼───────────────────┼────────────┼──────────────────────────┼───────┼───────┤\n│          121826514742652428.0048539.400.090.06N           O           1996-04-211996-03-301996-05-16NONE             AIR       s cajole busily above t 1281024 │\n│          116009676649679632.0050715.840.070.02N           O           1996-01-301996-02-071996-02-03DELIVER IN PERSONMAIL      rouches. special        1281024 │\n│          1246032741563281524.0028224.960.100.04N           O           1996-03-301996-03-141996-04-01NONE             FOB        the regular, regular pa1281024 │\n└────────────┴───────────┴───────────┴──────────────┴────────────────┴─────────────────┴────────────────┴────────────────┴──────────────┴──────────────┴────────────┴──────────────┴───────────────┴───────────────────┴────────────┴──────────────────────────┴───────┴───────┘\n
\n```\n:::\n:::\n\n\n![CPU/RAM while Ibis with the DuckDB backend sorting](ibis-duckdb-sort.gif)\n\n## Ibis (DataFusion)\n\n::: {#b07e6926 .cell execution_count=12}\n``` {.python .cell-code}\nibis.set_backend(\"datafusion\")\nt = ibis.read_parquet(data)\nt.order_by(t[\"l_orderkey\"], t[\"l_partkey\"], t[\"l_suppkey\"]).head(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ l_orderkey  l_partkey  l_suppkey  l_linenumber  l_quantity      l_extendedprice  l_discount      l_tax           l_returnflag  l_linestatus  l_shipdate  l_commitdate  l_receiptdate  l_shipinstruct     l_shipmode  l_comment                ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64int64int64int64decimal(15, 2)decimal(15, 2)decimal(15, 2)decimal(15, 2)stringstringdatedatedatestringstringstring                   │\n├────────────┼───────────┼───────────┼──────────────┼────────────────┼─────────────────┼────────────────┼────────────────┼──────────────┼──────────────┼────────────┼──────────────┼───────────────┼───────────────────┼────────────┼──────────────────────────┤\n│          121826514742652428.0048539.400.090.06N           O           1996-04-211996-03-301996-05-16NONE             AIR       s cajole busily above t  │\n│          116009676649679632.0050715.840.070.02N           O           1996-01-301996-02-071996-02-03DELIVER IN PERSONMAIL      rouches. special         │\n│          1246032741563281524.0028224.960.100.04N           O           1996-03-301996-03-141996-04-01NONE             FOB        the regular, regular pa │\n└────────────┴───────────┴───────────┴──────────────┴────────────────┴─────────────────┴────────────────┴────────────────┴──────────────┴──────────────┴────────────┴──────────────┴───────────────┴───────────────────┴────────────┴──────────────────────────┘\n
\n```\n:::\n:::\n\n\n![CPU/RAM while Ibis with the DataFusion backend sorting](ibis-datafusion-sort.gif)\n\n## Polars (lazy)\n\n```{.python}\ndf = pl.scan_parquet(data)\n(\n df.sort(pl.col(\"l_orderkey\"), pl.col(\"l_partkey\"), pl.col(\"l_suppkey\"))\n .head(3)\n .collect()\n)\n```\n\n```html\nThe Kernel crashed while executing code in the current cell or a previous cell.\nPlease review the code in the cell(s) to identify a possible cause of the failure.\nClick here for more info.\nView Jupyter log for further details.\n```\n\n![CPU/RAM while Polars with the lazy API sorting](polars-lazy-sort.gif)\n\n## Polars (lazy, streaming)\n\n```{.python}\ndf = pl.scan_parquet(data)\n(\n df.sort(pl.col(\"l_orderkey\"), pl.col(\"l_partkey\"), pl.col(\"l_suppkey\"))\n .head(3)\n .collect(streaming=True)\n)\n```\n\n```html\nPanicException: called `Result::unwrap()` on an `Err` value: \"SendError(..)\"\n```\n\nSee [GitHub\nissue](https://github.com/pola-rs/polars/issues/17289#issuecomment-2200469528).\n\n![CPU/RAM while Polars with the lazy API, streaming engine sorting](polars-lazy-streaming-sort.gif)\n\n:::\n\n:::{.callout-tip title=\"Visualize the Ibis expression tree\" collapse=\"true\"}\n\n::: {#83b9f9ff .cell execution_count=13}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to visualize the Ibis expression tree\"}\nfrom ibis.expr.visualize import to_graph\n\nto_graph(t.order_by(t[\"l_orderkey\"], t[\"l_partkey\"], t[\"l_suppkey\"]).head(3))\n```\n\n::: {.cell-output .cell-output-display execution_count=12}\n![](index_files/figure-html/cell-13-output-1.svg){}\n:::\n:::\n\n\n:::\n\nIbis with the DuckDB and DataFusion backends complete this in about 2 minutes\neach. Polars (lazy) crashes the kernel after about 2 minutes with its default\nmode and panics in streaming mode.\n\n**Streaming is an overloaded term here**. In the context of Ibis, a streaming\nbackend refers to a near real-time data processing engine like [Apache\nFlink](https://ibis-project.org/backends/flink) or\n[RisingWave](https://ibis-project.org/backends/risingwave). In the context of\nPolars, streaming is a separate engine from the default that can handle\nlarger-than-memory data. This general paradigm is already used by DuckDB and\nDataFusion, hence their ability to complete the above query. [The Polars team\ndoes not recommend using their current streaming engine for\nbenchmarking](https://github.com/pola-rs/polars/issues/16694#issuecomment-2146668559)\nand has [announced a new version of their streaming\nengine](https://pola.rs/posts/announcing-polars-1/#new-engine-design).\n\nAs we'll see in the benchmark result, some queries will fail to complete with\nPolars and DataFusion. These queries are killed by the operating system due to a\nlack of memory.\n\n:::{.callout-tip title=\"Sampling large datasets with Ibis\" collapse=\"true\"}\nIf we want to work with pandas or Polars dataframes at larger scales, we can use\nIbis to sample or filter the data (and perform any other operations) with\ncomputation pushed to a more scalable backend. Then just output the Ibis\ndataframe to pandas or Polars for downstream use:\n\n\n\n:::{.panel-tabset}\n\n## pandas\n\n::: {#66fb5bcd .cell execution_count=15}\n``` {.python .cell-code}\nt = ibis.read_parquet(data)\n\ndf = (\n t.sample(fraction=0.0001)\n .order_by(t[\"l_orderkey\"], t[\"l_partkey\"], t[\"l_suppkey\"])\n .to_pandas()\n)\ndf.head(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
l_orderkeyl_partkeyl_suppkeyl_linenumberl_quantityl_extendedpricel_discountl_taxl_returnflagl_linestatusl_shipdatel_commitdatel_receiptdatel_shipinstructl_shipmodel_commentnsf
03298295733986533405325.0036748.000.100.08NO1996-06-301996-05-311996-07-23COLLECT CODSHIPs! final pin1281024
11792167164080604093521.0023955.330.070.00NO1998-09-211998-10-081998-10-19TAKE BACK RETURNTRUCKress requests nag against the slyl1281024
21927058990671779067227.0012311.110.080.01NO1998-03-171997-12-201998-03-18TAKE BACK RETURNTRUCKboost closely. furiously1281024
\n
\n```\n:::\n:::\n\n\n## Polars\n\n::: {#74b1e54c .cell execution_count=16}\n``` {.python .cell-code}\nt = ibis.read_parquet(data)\n\ndf = (\n t.sample(fraction=0.0001)\n .order_by(t[\"l_orderkey\"], t[\"l_partkey\"], t[\"l_suppkey\"])\n .to_polars()\n)\ndf.head(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
\nshape: (3, 18)
l_orderkeyl_partkeyl_suppkeyl_linenumberl_quantityl_extendedpricel_discountl_taxl_returnflagl_linestatusl_shipdatel_commitdatel_receiptdatel_shipinstructl_shipmodel_commentnsf
i64i64i64i64decimal[15,2]decimal[15,2]decimal[15,2]decimal[15,2]strstrdatedatedatestrstrstri64i64
963923541025501032242.0044723.700.090.08"A""F"1993-09-241993-11-161993-09-27"NONE""MAIL""ses through th"1281024
331531826870328607033415.0015148.500.080.04"N""O"1997-08-201997-07-111997-08-30"COLLECT COD""REG AIR""iously furio"1281024
37284330519694891973216.0030708.960.040.00"N""O"1995-07-101995-05-161995-07-22"DELIVER IN PERSON""AIR""ts. even deposits cajole after…1281024
\n```\n:::\n:::\n\n\n:::\n\nWe can also use this to iterate more quickly on a subset of data with Ibis to\nconstruct our queries. Once we're happy with them, we can change one line of\ncode to run them on the full data.\n\n:::\n\n## 1TB TPC-H benchmark results\n\nLet's delve into the results of benchmarking ~1TB (`sf=1024`) TPC-H queries on a\nlaptop.\n\n:::{.callout-important title=\"Not an official TPC-H benchmark\"}\nThis is not an [official TPC-H benchmark](https://www.tpc.org/tpch). We ran a\nderivate of the TPC-H benchmark.\n:::\n\n:::{.callout-warning title=\"Key differences from previous benchmarking\"}\nSee [the prior benchmark post](../ibis-bench/index.qmd) for more details and key\nconsiderations. Key differences in this iteration include:\n\n1. `polars-u64-idx` was used instead of `polars`\n2. [Some Polars queries were\n updated](https://github.com/lostmygithubaccount/ibis-bench/pull/5)\n3. Parquet files were generated with `n=128` partitions\n - this was done to avoid OOM errors when generating the data\n - this should have little impact on the query execution time\n4. Queries 18 and 21 for Polars, 9 and 18 for DataFusion were skipped\n - they ran for a very long time without completing or failing\n - the prior benchmark indicates these queries would likely eventually fail\n\nThe Python package versions used were:\n\n- `ibis-framework==9.1.0`\n- `datafusion==38.0.1`\n- `duckdb==1.0.0`\n- `polars-u64-idx==1.0.0`\n\nThe three systems tested were:\n\n- `ibis-duckdb`: Ibis dataframe code on the DuckDB backend\n- `ibis-datafusion`: Ibis dataframe code on the DataFusion backend\n- `polars-lazy`: Polars (lazy API) dataframe code\n:::\n\nTo follow along, install the required packages:\n\n```bash\npip install 'ibis-framework[duckdb]' gcsfs plotly great-tables\n```\n\nThe code for reading and analyzing the data is collapsed below.\n\n::: {#5cf05c32 .cell execution_count=17}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to read and analyze the benchmark data\"}\nimport ibis\nimport gcsfs\nimport plotly.express as px\n\nfrom great_tables import GT, md\n\npx.defaults.template = \"plotly_dark\"\n\nibis.set_backend(\"duckdb\")\nibis.options.interactive = True\nibis.options.repr.interactive.max_rows = 3\n\nfs = gcsfs.GCSFileSystem()\nibis.get_backend().register_filesystem(fs)\n\nt = (\n ibis.read_parquet(\n \"gs://ibis-bench/1tbc/cache/file_id=*.parquet\",\n )\n .select(\n \"system\",\n \"sf\",\n \"n_partitions\",\n \"query_number\",\n \"execution_seconds\",\n \"timestamp\",\n )\n .mutate(timestamp=ibis._[\"timestamp\"].cast(\"timestamp\"))\n .order_by(\"system\", \"query_number\")\n .cache()\n)\n\nsystems = sorted(t.distinct(on=\"system\")[\"system\"].collect().to_pyarrow().as_py())\n\nagg = (\n t.mutate(\n run_num=ibis.row_number().over(\n group_by=[\"system\", \"sf\", \"n_partitions\", \"query_number\"],\n order_by=[\"timestamp\"],\n )\n )\n .relocate(t.columns[:4], \"run_num\")\n .group_by(\"system\", \"query_number\", \"run_num\")\n .agg(execution_seconds=ibis._[\"execution_seconds\"].mean())\n .order_by(\"system\", \"query_number\", \"run_num\")\n)\nagg2 = (\n agg.group_by(\"system\", \"query_number\")\n .agg(avg_execution_seconds=agg.execution_seconds.mean().round(2))\n .order_by(\"system\", \"query_number\")\n)\npiv = agg2.pivot_wider(\n names_from=\"system\", values_from=[\"avg_execution_seconds\"]\n).order_by(\"query_number\")\n\n\ndef x_vs_y(piv, x, y):\n return ibis.ifelse(\n piv[x] < piv[y],\n -1,\n 1,\n ) * (\n (\n (piv[x] - piv[y])\n / ibis.ifelse(\n piv[y] > piv[x],\n piv[x],\n piv[y],\n )\n ).abs()\n ).round(4)\n\n\ncomparisons = [\n (\"ibis-datafusion\", \"ibis-duckdb\"),\n (\"polars-lazy\", \"ibis-datafusion\"),\n (\"polars-lazy\", \"ibis-duckdb\"),\n]\n\ncomparisons = {f\"{x}_v_{y}\": x_vs_y(piv, x, y) for x, y in comparisons}\n\npiv2 = piv.mutate(**comparisons)\npiv2 = piv2.order_by(\"query_number\").relocate(\"query_number\", systems)\n\nagg3 = (\n agg2.group_by(\"system\")\n .agg(\n queries_completed=agg2[\"avg_execution_seconds\"].count(),\n execution_seconds=agg2[\"avg_execution_seconds\"].sum().round(2),\n seconds_per_query=agg2[\"avg_execution_seconds\"].mean().round(2),\n )\n .order_by(ibis.desc(\"queries_completed\"))\n)\nagg3\n```\n\n::: {.cell-output .cell-output-display execution_count=16}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓\n┃ system           queries_completed  execution_seconds  seconds_per_query ┃\n┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩\n│ stringint64float64float64           │\n├─────────────────┼───────────────────┼───────────────────┼───────────────────┤\n│ ibis-duckdb    221448.4265.84 │\n│ ibis-datafusion171182.2369.54 │\n│ polars-lazy    131995.16153.47 │\n└─────────────────┴───────────────────┴───────────────────┴───────────────────┘\n
\n```\n:::\n:::\n\n\n`ibis-duckdb` completed all 22/22 queries **in under 30 minutes**. If you need\nto run batch data jobs on a similar amount of data, a laptop might be all you\nneed!\n\n`ibis-datafusion` only completed 17/22 queries, though recall [3 are failing due\nto a bug that's already been\nfixed](../ibis-bench/index.qmd#failing-datafusion-queries). A new Python release\nfor DataFusion hasn't been made yet, so we ran with the old version. Assuming\nthose queries would complete, only 2 queries would be failing due to lack of\nmemory. More investigation would be needed to determine the work needed for all\n22 queries to pass under these conditions.\n\n`polars-lazy` only completed 13/22 queries, with 8 failing due lack of memory.\nThe [new streaming\nengine](https://pola.rs/posts/announcing-polars-1/#new-engine-design) will\nlikely help with this.\n\nLet's plot execution time for each query and system:\n\n:::{.callout-tip title=\"You can de-select systems in the legend\"}\nIt might be easier to look at 2 systems at a time. You can click on a system in\nthe legend of the plot to de-select it.\n:::\n\n::: {#f03b08fb .cell execution_count=18}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to plot execution time by query and system\"}\nc = px.bar(\n agg2,\n x=\"query_number\",\n y=\"avg_execution_seconds\",\n title=\"Average execution time by query\",\n color=\"system\",\n barmode=\"group\",\n log_y=True,\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nLet's show a [Great Tables](https://github.com/posit-dev/great-tables) table of\npivoted data including relative speed differences between the systems:\n\n::: {#5ce12f7f .cell execution_count=19}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to create Great Table table from pivoted aggregated benchmark data\"}\ncolor_palette = \"plasma\"\nna_color = \"black\"\nstyle_color = \"cyan\"\n\ntbl = (\n GT(\n piv2.mutate(**{\" \": ibis.literal(\"\")})\n .select(\n \"query_number\",\n *systems,\n \" \",\n *list(comparisons.keys()),\n )\n .to_polars()\n )\n .opt_stylize(\n style=1,\n color=style_color,\n )\n .tab_header(\n title=md(\"1TB (`sf=1024`) TPC-H queries\"),\n subtitle=md(\"*on a laptop* (MacBook Pro | Apple M2 Max | 96GiB RAM)\"),\n )\n .tab_spanner(label=\"execution time (seconds)\", columns=systems)\n .tab_spanner(label=\" \", columns=\" \")\n .tab_spanner(label=\"relative speed difference†\", columns=list(comparisons))\n .tab_source_note(\n source_note=md(\n \"†[Relative speed difference formula](https://docs.coiled.io/blog/tpch#measurements), with negative values indicating A was faster than B for A_v_B\"\n )\n )\n .tab_source_note(\n source_note=md(\n \"Benchmark results source data (public bucket): `gs://ibis-bench/1tbc/cache/file_id=*.parquet`\"\n )\n )\n .fmt_percent(list(comparisons), decimals=2, scale_values=True)\n .data_color(\n columns=systems,\n domain=[0, agg2[\"avg_execution_seconds\"].max().to_pyarrow().as_py()],\n palette=color_palette,\n na_color=na_color,\n )\n .data_color(\n columns=\" \",\n palette=[\"#333333\", \"#333333\"],\n )\n .data_color(\n columns=list(comparisons),\n domain=[\n min(\n [piv2[c].min().to_pyarrow().as_py() for c in list(comparisons)],\n ),\n max(\n [piv2[c].max().to_pyarrow().as_py() for c in list(comparisons)],\n ),\n ],\n palette=color_palette,\n na_color=na_color,\n )\n)\ntbl\n```\n\n::: {.cell-output .cell-output-display execution_count=18}\n```{=html}\n
\n\n\n\n \n \n \n \n \n \n\n\n \n \n \n \n\n\n \n \n \n \n \n \n \n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n\n\n \n \n \n\n\n\n
1TB (sf=1024) TPC-H queries
on a laptop (MacBook Pro | Apple M2 Max | 96GiB RAM)
query_number\n execution time (seconds)\n \n \n \n relative speed difference†\n
ibis-datafusionibis-duckdbpolars-lazy ibis-datafusion_v_ibis-duckdbpolars-lazy_v_ibis-datafusionpolars-lazy_v_ibis-duckdb
187.6777.45None13.20%NoneNone
28.687.389.4417.62%8.76%27.91%
350.4348.72None3.51%NoneNone
434.8133.2973.724.57%111.78%121.45%
561.4351.99None18.16%NoneNone
633.0533.879.69−2.27%141.12%135.77%
790.75102.99305.02−13.49%236.11%196.16%
872.9862.05None17.61%NoneNone
9None116.41NoneNoneNoneNone
1075.6358.34262.0929.64%246.54%349.25%
1122.2510.2225.31117.71%13.75%147.65%
1254.0946.86126.7715.43%134.37%170.53%
1360.7248.57None25.02%NoneNone
1440.0638.79101.463.27%153.27%161.56%
1573.6769.3963.326.17%−16.35%−9.59%
16None9.4810.93NoneNone15.30%
17252.6754.44None364.13%NoneNone
18None350.98NoneNoneNoneNone
1979.7864.85422.9423.02%430.13%552.18%
2083.5641.18498.25102.91%496.28%1,109.93%
21None110.43NoneNoneNoneNone
22None10.8116.22NoneNone50.05%
Relative speed difference formula, with negative values indicating A was faster than B for A_v_B
Benchmark results source data (public bucket): gs://ibis-bench/1tbc/cache/file_id=*.parquet
\n\n
\n \n```\n:::\n:::\n\n\nYou can use the code above to further explore and visualize the results.\n\n## Why does this matter?\n\nThe ability to run all 1TB TPC-H queries on a relatively standard laptop with\nminimal setup represents a significant shift in the Python data ecosystem that\nbenefits individual data practitioners and organizations.\n\n### Scale up, then scale out\n\nDistributed systems are hard and introduce complexity for data workloads. While\ndistributed OLAP query engines have their place, the threshold for considering\nthem against a single-node OLAP query engine has been raised drastically over\nthe last few years. You can [see how much DuckDB has improved over the\nyears](https://duckdb.org/2024/06/26/benchmarks-over-time) and it shows in this\nbenchmark.\n\nIt's a good idea to start with a single node and see how far you can get. You'll\nneed to consider the tradeoffs for your own situation to make a decision. With\nIbis, you can write your queries once and try them on different engines to see\nwhich is best for your workload.\n\n### Composable data systems are here\n\nIbis separates the query from the engine. It translates dataframe code into an\nintermediate representation (IR) in the backend's native language -- often SQL,\nsometimes other Python dataframe code. This separation allows you **to use a\nsingle dataframe API for the best engine(s) across your workload(s)**.\n\nIf you need to analyze data in\n[Postgres](https://ibis-project.org/backends/postgres), you can use Ibis. If you\nneed to [speed that up with\nDuckDB](https://duckdb.org/2022/09/30/postgres-scanner.html), you can [use\nIbis](https://ibis-project.org/backends/duckdb#ibis.backends.duckdb.Backend.read_postgres).\nIf you need to scale out with [Dask](https://ibis-project.org/backends/dask) or\n[PySpark](https://ibis-project.org/backends/pyspark) or\n[Trino](https://ibis-project.org/backends/trino), you can use Ibis. If you need\nto [scale out on distributed GPUs you can use\nIbis](../why-voda-supports-ibis/index.qmd). If another query engine comes along\nand is best for your workload, you can probably use Ibis. New backends are\nfairly easy to add!\n\n### It's efficient\n\nHow much money does your organization spend on data transformation per terabyte?\nUsing [the GCP pricing calculator](https://cloud.google.com/products/calculator)\nwe'll sample the monthly cost of some cloud instances including a few TBs of\nsolid-state hard drive space. Hover over to see the vCPUs and RAM for each\ninstance.\n\n::: {#30e1f6dc .cell execution_count=20}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to plot monthly cost of various GCP instances\"}\ndata = {\n \"m1-megamem-40\": {\"vCPUs\": 40, \"RAM\": 961, \"cost\": 6200},\n \"m1-ultramem-80\": {\"vCPUs\": 80, \"RAM\": 1922, \"cost\": 10900},\n \"m1-ultramem-160\": {\"vCPUs\": 160, \"RAM\": 3844, \"cost\": 20100},\n \"h3-standard-88\": {\"vCPUs\": 88, \"RAM\": 352, \"cost\": 4600},\n \"c2-standard-30\": {\"vCPUs\": 30, \"RAM\": 120, \"cost\": 1600},\n \"c2-standard-60\": {\"vCPUs\": 60, \"RAM\": 240, \"cost\": 2700},\n}\n\nt = ibis.memtable(\n {\n \"name\": list(data.keys()),\n \"vCPUs\": [v[\"vCPUs\"] for v in data.values()],\n \"RAM (GBs)\": [v[\"RAM\"] for v in data.values()],\n \"cost\": [v[\"cost\"] for v in data.values()],\n }\n).order_by(\"cost\")\n\nc = px.bar(\n t,\n x=\"name\",\n y=\"cost\",\n title=\"Monthly cost (USD) of various GCP instances\",\n hover_data=[\"vCPUs\", \"RAM (GBs)\"],\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nFor ~$1,600/month we can get a machine with more CPU cores and RAM than the\nlaptop benchmarked in this post. This cost assumes you're running the machine\n24/7 -- if you only needed to run a workload similar to the benchmark here,\nyou'd only need to run the machine <1 hour per day using Ibis with the default\nDuckDB backend. This can serve as a good anchor when evaluating your cost of\ncompute for data.\n\nA composable data system with Python dataframe and SQL user experiences can\nscale vertically to handle workloads into 10TB+ range with modern single-node\nOLAP query engines. If you need a distributed query engine or a better\nsingle-node query engine for your workload materializes, you can swap them out\nwithout changing your queries. However, note that with vertical scaling you're\nlikely to hit storage or network bottlenecks before compute bottlenecks on real\nworkloads.\n\n## Next steps\n\nWe'll follow up on this post once new versions that fix issues or improve\nperformance significantly are released. If you're interested in getting started\nwith Ibis, see [our tutorial](/tutorials/getting_started.qmd).\n\n", + "supporting": [ + "index_files/figure-html" + ], + "filters": [], + "includes": { + "include-in-header": [ + "\n\n\n\n\n" + ] + } + } +} \ No newline at end of file diff --git a/docs/_freeze/posts/1tbc/index/figure-html/cell-10-output-1.svg b/docs/_freeze/posts/1tbc/index/figure-html/cell-10-output-1.svg new file mode 100644 index 000000000000..01cfa175ea58 --- /dev/null +++ b/docs/_freeze/posts/1tbc/index/figure-html/cell-10-output-1.svg @@ -0,0 +1,104 @@ + + + + + + + + + +4939866050042439816 + +Limit +l_orderkey +: int64 +l_partkey +: int64 +l_suppkey +: int64 +l_linenumber +: int64 +l_quantity +: decimal(15, 2) +l_extendedprice +: decimal(15, 2) +l_discount +: decimal(15, 2) +l_tax +: decimal(15, 2) +l_returnflag +: string +l_linestatus +: string +l_shipdate +: date +l_commitdate +: date +l_receiptdate +: date +l_shipinstruct +: string +l_shipmode +: string +l_comment +: string +n +: int64 +sf +: int64 + + + +-5870359379612694500 + +ibis_read_parquet_h464bevmxnhvdekiuwqu4il224 +: +DatabaseTable +l_orderkey +: int64 +l_partkey +: int64 +l_suppkey +: int64 +l_linenumber +: int64 +l_quantity +: decimal(15, 2) +l_extendedprice +: decimal(15, 2) +l_discount +: decimal(15, 2) +l_tax +: decimal(15, 2) +l_returnflag +: string +l_linestatus +: string +l_shipdate +: date +l_commitdate +: date +l_receiptdate +: date +l_shipinstruct +: string +l_shipmode +: string +l_comment +: string +n +: int64 +sf +: int64 + + + +-5870359379612694500->4939866050042439816 + + + + + diff --git a/docs/_freeze/posts/1tbc/index/figure-html/cell-13-output-1.svg b/docs/_freeze/posts/1tbc/index/figure-html/cell-13-output-1.svg new file mode 100644 index 000000000000..4066829e3c60 --- /dev/null +++ b/docs/_freeze/posts/1tbc/index/figure-html/cell-13-output-1.svg @@ -0,0 +1,242 @@ + + + + + + + + + +-2499850609034829945 + +Limit +l_orderkey +: int64 +l_partkey +: int64 +l_suppkey +: int64 +l_linenumber +: int64 +l_quantity +: decimal(15, 2) +l_extendedprice +: decimal(15, 2) +l_discount +: decimal(15, 2) +l_tax +: decimal(15, 2) +l_returnflag +: string +l_linestatus +: string +l_shipdate +: date +l_commitdate +: date +l_receiptdate +: date +l_shipinstruct +: string +l_shipmode +: string +l_comment +: string + + + +-25412810237011383 + +Sort +l_orderkey +: int64 +l_partkey +: int64 +l_suppkey +: int64 +l_linenumber +: int64 +l_quantity +: decimal(15, 2) +l_extendedprice +: decimal(15, 2) +l_discount +: decimal(15, 2) +l_tax +: decimal(15, 2) +l_returnflag +: string +l_linestatus +: string +l_shipdate +: date +l_commitdate +: date +l_receiptdate +: date +l_shipinstruct +: string +l_shipmode +: string +l_comment +: string + + + +-25412810237011383->-2499850609034829945 + + + + + +1983693159935793149 + +ibis_read_parquet_3jyrhic6tbc47liqzsmhwlhghm +: +DatabaseTable +l_orderkey +: int64 +l_partkey +: int64 +l_suppkey +: int64 +l_linenumber +: int64 +l_quantity +: decimal(15, 2) +l_extendedprice +: decimal(15, 2) +l_discount +: decimal(15, 2) +l_tax +: decimal(15, 2) +l_returnflag +: string +l_linestatus +: string +l_shipdate +: date +l_commitdate +: date +l_receiptdate +: date +l_shipinstruct +: string +l_shipmode +: string +l_comment +: string + + + +1983693159935793149->-25412810237011383 + + + + + +-1944786885231050517 + +l_orderkey +: +Field +:: int64 + + + +1983693159935793149->-1944786885231050517 + + + + + +-7426634216753830959 + +l_partkey +: +Field +:: int64 + + + +1983693159935793149->-7426634216753830959 + + + + + +8984118758009370914 + +l_suppkey +: +Field +:: int64 + + + +1983693159935793149->8984118758009370914 + + + + + +-6160350107390613248 + +SortKey +:: int64 + + + +-6160350107390613248->-25412810237011383 + + + + + +-3259648732239577112 + +SortKey +:: int64 + + + +-3259648732239577112->-25412810237011383 + + + + + +1906627580404737794 + +SortKey +:: int64 + + + +1906627580404737794->-25412810237011383 + + + + + +-1944786885231050517->-6160350107390613248 + + + + + +-7426634216753830959->-3259648732239577112 + + + + + +8984118758009370914->1906627580404737794 + + + + + diff --git a/docs/posts/1tbc/.gitignore b/docs/posts/1tbc/.gitignore new file mode 100644 index 000000000000..644885548704 --- /dev/null +++ b/docs/posts/1tbc/.gitignore @@ -0,0 +1,4 @@ +ibis-bench +tpch_data +results_data +bench_logs_v* diff --git a/docs/posts/1tbc/ibis-datafusion-sort.gif b/docs/posts/1tbc/ibis-datafusion-sort.gif new file mode 100644 index 000000000000..65d8edb8300e Binary files /dev/null and b/docs/posts/1tbc/ibis-datafusion-sort.gif differ diff --git a/docs/posts/1tbc/ibis-duckdb-sort.gif b/docs/posts/1tbc/ibis-duckdb-sort.gif new file mode 100644 index 000000000000..0537a6596b15 Binary files /dev/null and b/docs/posts/1tbc/ibis-duckdb-sort.gif differ diff --git a/docs/posts/1tbc/index.qmd b/docs/posts/1tbc/index.qmd new file mode 100644 index 000000000000..2d0c3a5b8b00 --- /dev/null +++ b/docs/posts/1tbc/index.qmd @@ -0,0 +1,762 @@ +--- +title: "Querying 1TB on a laptop with Python dataframes" +author: "Cody Peterson" +date: "2024-07-08" +image: ibis-duckdb-sort.gif +categories: + - benchmark + - duckdb + - datafusion + - polars +--- + +***TPC-H benchmark at `sf=1024` via DuckDB, DataFusion, and Polars on a MacBook +Pro with 96GiB of RAM.*** + +--- + +pandas requires your dataframe to fit in memory. Out-of-memory (OOM) errors are +common when working on larger datasets, though the corresponding size of data on +disk can be surprising. The creator of pandas and Ibis noted in ["Apache +Arrow and the '10 Things I Hate About +pandas'"](https://wesmckinney.com/blog/apache-arrow-pandas-internals): + +> To put it simply, **we weren’t thinking about analyzing 100 GB or 1 TB datasets +> in 2011**. [In 2017], my rule of thumb for pandas is that **you should have 5 to +> 10 times as much RAM as the size of your dataset**. So if you have a 10 GB +> dataset, you should really have about 64, preferably 128 GB of RAM if you want +> to avoid memory management problems. This comes as a shock to users who expect +> to be able to analyze datasets that are within a factor of 2 or 3 the size of +> their computer’s RAM. + +Today with Ibis you can reliably and efficiently process a 1TB dataset on a +laptop with <1/10th the RAM. + +:::{.callout-important} +This represents **a 50-100X improvement** in RAM requirements for Python +dataframes in just 7 years thanks to [composable data +systems](https://wesmckinney.com/blog/looking-back-15-years) and [hard work by +the DuckDB team](https://duckdb.org/2024/06/26/benchmarks-over-time). +::: + +## Exploring the data with Python dataframes + +I've generated ~1TB (`sf=1024`) of [TPC-H data](https://www.tpc.org/tpch) on my +MacBook Pro with 96 GiB of RAM. We'll start exploring it with pandas, Polars, +and Ibis and discuss where and why they start to struggle. + +:::{.callout-tip title="Generating the data" collapse="true"} +See [the previous post](../ibis-bench/index.qmd#reproducing-the-benchmark) for +instructions on generating the data. I used `bench gen-data -s 1024 -n 128`, +partitioning the data to avoid OOM errors while it generated. + +I'd recommend instead generating a smaller scale factor and copying it as many +times as needed, as generating the data at `sf=1024` can take a long time. +::: + +To follow along, install the required packages: + +```bash +pip install pandas 'ibis-framework[duckdb,datafusion]' polars-u64-idx plotly +``` + +:::{.callout-note title="Why polars-u64-idx?" collapse="true"} +We need to use `polars-u64-idx` instead of `polars` [to work with >4.2 billion +rows](https://docs.pola.rs/user-guide/installation/#big-index). +::: + +Imports and setup: + +```{python} +import os +import glob +import ibis +import pandas as pd +import polars as pl +import plotly.express as px + +px.defaults.template = "plotly_dark" +ibis.options.interactive = True +``` + +```{python} +#| code-fold: true +#| echo: false +ibis.set_backend("duckdb") +ibis.get_backend().raw_sql("PRAGMA disable_progress_bar;"); +``` + +Let's check the number of rows across all tables in the TPC-H data: + +```{python} +#| code-fold: true +#| code-summary: "Show code to get number of rows in TPC-H data" +sf = 1024 +n = 128 +data_dir = f"tpch_data/parquet/sf={sf}/n={n}" +tables = glob.glob(f"{data_dir}/*") + +total_rows = 0 + +for table in tables: + t = ibis.read_parquet(f"{table}/*.parquet") + total_rows += t.count().to_pyarrow().as_py() + +print(f"total rows: {total_rows:,}") +``` + +Over 8.8 billion rows! + +We can compute and visualize the sizes of the tables in the TPC-H data (as +compressed Parquet files on disk): + +```{python} +#| code-fold: true +#| code-summary: "Show code to get sizes of tables in TPC-H data" +def get_dir_size(path): + from pathlib import Path + + return sum(p.stat().st_size for p in Path(path).rglob("*") if p.is_file()) + + +sizes = [get_dir_size(table) for table in tables] +names = [os.path.basename(table) for table in tables] + +tmp = ibis.memtable({"name": names, "size": sizes}) +tmp = tmp.mutate(size_gb=tmp["size"] / (1024**3)) +tmp = tmp.mutate(size_gb_mem=tmp["size_gb"] * 11 / 5) +tmp = tmp.order_by(ibis.desc("size_gb")) + +c = px.bar( + tmp, + x="name", + y="size_gb", + title="table sizes in TPC-H data", + hover_data=["size_gb_mem"], + labels={ + "name": "table name", + "size_gb": "size (GB on-disk in compressed Parquet files)", + "size_gb_mem": "size (approximate GB in memory)", + }, +) + +print( + f"total size: {tmp['size_gb'].sum().to_pyarrow().as_py():,.2f}GBs (compressed Parquet files)" +) +c +``` + +In-memory this would be about 1TB. Uncompressed CSV files would be >1TB on disk. + +Let's explore the largest table, `lineitem`. This table in memory is ~6X larger +than RAM. + +```{python} +#| code-fold: true +#| code-summary: "Show code to explore the lineitem table" +table_name = "lineitem" +data = f"{data_dir}/{table_name}/*.parquet" + +t = ibis.read_parquet(data) +print(f"rows: {t.count().to_pyarrow().as_py():,} | columns: {len(t.columns)}") +``` + +Over 6 billion rows! + +Let's try to display the first few rows with Ibis, pandas, and Polars: + +::: {.panel-tabset} + +## Ibis + +```{python} +t = ibis.read_parquet(data) +t.head(3) +``` + +## pandas + +```{.python} +df = pd.concat([pd.read_parquet(f) for f in glob.glob(data)], ignore_index=True) # <1> +df.head(3) +``` + +1. Work around lack of reading multiple parquet files in pandas + +```html +The Kernel crashed while executing code in the current cell or a previous cell. +Please review the code in the cell(s) to identify a possible cause of the failure. +Click here for more info. +View Jupyter log for further details. +``` + +## Polars (eager) + +```{.python} +df = pl.read_parquet(data) +df.head(3) +``` + +```html +The Kernel crashed while executing code in the current cell or a previous cell. +Please review the code in the cell(s) to identify a possible cause of the failure. +Click here for more info. +View Jupyter log for further details. +``` + +## Polars (lazy) + +```{python} +df = pl.scan_parquet(data) +df.head(3).collect() +``` + +## Polars (lazy, streaming) + +```{python} +df = pl.scan_parquet(data) +df.head(3).collect(streaming=True) +``` + +::: + +Ibis, with the default backend of DuckDB, can display the first few rows. Polars +(lazy) can too in regular and streaming mode. For lazily computation, an +underlying query engine has the opportunity to determine a subset of data to be +read into memory that satisfies a given query. For example, to display any three +rows from the `lineitem` table it can just read the first three rows from the +first Parquet file in the dataset. + +Both pandas and Polars (eager) crash Python as they must load all the data into +memory to construct their dataframes. This is expected because the table in +memory ~6X larger than our 96GiB of RAM. + +:::{.callout-tip title="Visualize the Ibis expression tree" collapse="true"} + +```{python} +#| code-fold: true +#| code-summary: "Show code to visualize the Ibis expression tree" +from ibis.expr.visualize import to_graph + +to_graph(t.head(3)) +``` + +::: + +Let's try something more challenging: [partially +sorting](https://en.wikipedia.org/wiki/Partial_sorting) the `lineitem` table. +This forces at least some columns from all rows of data to pass through the +query engine to determine the top 3 rows per the specified ordering. Since the +data is larger than RAM, only "streaming" engines can handle this. We'll try +with the methods that worked on the previous query and add in the DataFusion +backend for Ibis. + +::: {.panel-tabset} + +## Ibis (DuckDB) + +```{.python} +ibis.set_backend("duckdb") +t = ibis.read_parquet(data) +t.order_by(t["l_orderkey"], t["l_partkey"], t["l_suppkey"]).head(3) +``` + +```{python} +#| code-fold: true +#| echo: false +ibis.set_backend("duckdb") +ibis.get_backend().raw_sql("PRAGMA disable_progress_bar;") +t = ibis.read_parquet(data) +t.order_by(t["l_orderkey"], t["l_partkey"], t["l_suppkey"]).head(3) +``` + +![CPU/RAM while Ibis with the DuckDB backend sorting](ibis-duckdb-sort.gif) + +## Ibis (DataFusion) + +```{python} +ibis.set_backend("datafusion") +t = ibis.read_parquet(data) +t.order_by(t["l_orderkey"], t["l_partkey"], t["l_suppkey"]).head(3) +``` + +![CPU/RAM while Ibis with the DataFusion backend sorting](ibis-datafusion-sort.gif) + +## Polars (lazy) + +```{.python} +df = pl.scan_parquet(data) +( + df.sort(pl.col("l_orderkey"), pl.col("l_partkey"), pl.col("l_suppkey")) + .head(3) + .collect() +) +``` + +```html +The Kernel crashed while executing code in the current cell or a previous cell. +Please review the code in the cell(s) to identify a possible cause of the failure. +Click here for more info. +View Jupyter log for further details. +``` + +![CPU/RAM while Polars with the lazy API sorting](polars-lazy-sort.gif) + +## Polars (lazy, streaming) + +```{.python} +df = pl.scan_parquet(data) +( + df.sort(pl.col("l_orderkey"), pl.col("l_partkey"), pl.col("l_suppkey")) + .head(3) + .collect(streaming=True) +) +``` + +```html +PanicException: called `Result::unwrap()` on an `Err` value: "SendError(..)" +``` + +See [GitHub +issue](https://github.com/pola-rs/polars/issues/17289#issuecomment-2200469528). + +![CPU/RAM while Polars with the lazy API, streaming engine sorting](polars-lazy-streaming-sort.gif) + +::: + +:::{.callout-tip title="Visualize the Ibis expression tree" collapse="true"} + +```{python} +#| code-fold: true +#| code-summary: "Show code to visualize the Ibis expression tree" +from ibis.expr.visualize import to_graph + +to_graph(t.order_by(t["l_orderkey"], t["l_partkey"], t["l_suppkey"]).head(3)) +``` + +::: + +Ibis with the DuckDB and DataFusion backends complete this in about 2 minutes +each. Polars (lazy) crashes the kernel after about 2 minutes with its default +mode and panics in streaming mode. + +**Streaming is an overloaded term here**. In the context of Ibis, a streaming +backend refers to a near real-time data processing engine like [Apache +Flink](https://ibis-project.org/backends/flink) or +[RisingWave](https://ibis-project.org/backends/risingwave). In the context of +Polars, streaming is a separate engine from the default that can handle +larger-than-memory data. This general paradigm is already used by DuckDB and +DataFusion, hence their ability to complete the above query. [The Polars team +does not recommend using their current streaming engine for +benchmarking](https://github.com/pola-rs/polars/issues/16694#issuecomment-2146668559) +and has [announced a new version of their streaming +engine](https://pola.rs/posts/announcing-polars-1/#new-engine-design). + +As we'll see in the benchmark result, some queries will fail to complete with +Polars and DataFusion. These queries are killed by the operating system due to a +lack of memory. + +:::{.callout-tip title="Sampling large datasets with Ibis" collapse="true"} +If we want to work with pandas or Polars dataframes at larger scales, we can use +Ibis to sample or filter the data (and perform any other operations) with +computation pushed to a more scalable backend. Then just output the Ibis +dataframe to pandas or Polars for downstream use: + +```{python} +#| code-fold: true +#| echo: false +ibis.set_backend("duckdb") +ibis.get_backend().raw_sql("PRAGMA disable_progress_bar;"); +``` + +:::{.panel-tabset} + +## pandas + +```{python} +t = ibis.read_parquet(data) + +df = ( + t.sample(fraction=0.0001) + .order_by(t["l_orderkey"], t["l_partkey"], t["l_suppkey"]) + .to_pandas() +) +df.head(3) +``` + +## Polars + +```{python} +t = ibis.read_parquet(data) + +df = ( + t.sample(fraction=0.0001) + .order_by(t["l_orderkey"], t["l_partkey"], t["l_suppkey"]) + .to_polars() +) +df.head(3) +``` + +::: + +We can also use this to iterate more quickly on a subset of data with Ibis to +construct our queries. Once we're happy with them, we can change one line of +code to run them on the full data. + +::: + +## 1TB TPC-H benchmark results + +Let's delve into the results of benchmarking ~1TB (`sf=1024`) TPC-H queries on a +laptop. + +:::{.callout-important title="Not an official TPC-H benchmark"} +This is not an [official TPC-H benchmark](https://www.tpc.org/tpch). We ran a +derivate of the TPC-H benchmark. +::: + +:::{.callout-warning title="Key differences from previous benchmarking"} +See [the prior benchmark post](../ibis-bench/index.qmd) for more details and key +considerations. Key differences in this iteration include: + +1. `polars-u64-idx` was used instead of `polars` +2. [Some Polars queries were + updated](https://github.com/lostmygithubaccount/ibis-bench/pull/5) +3. Parquet files were generated with `n=128` partitions + - this was done to avoid OOM errors when generating the data + - this should have little impact on the query execution time +4. Queries 18 and 21 for Polars, 9 and 18 for DataFusion were skipped + - they ran for a very long time without completing or failing + - the prior benchmark indicates these queries would likely eventually fail + +The Python package versions used were: + +- `ibis-framework==9.1.0` +- `datafusion==38.0.1` +- `duckdb==1.0.0` +- `polars-u64-idx==1.0.0` + +The three systems tested were: + +- `ibis-duckdb`: Ibis dataframe code on the DuckDB backend +- `ibis-datafusion`: Ibis dataframe code on the DataFusion backend +- `polars-lazy`: Polars (lazy API) dataframe code +::: + +To follow along, install the required packages: + +```bash +pip install 'ibis-framework[duckdb]' gcsfs plotly great-tables +``` + +The code for reading and analyzing the data is collapsed below. + +```{python} +#| code-fold: true +#| code-summary: "Show code to read and analyze the benchmark data" +import ibis +import gcsfs +import plotly.express as px + +from great_tables import GT, md + +px.defaults.template = "plotly_dark" + +ibis.set_backend("duckdb") +ibis.options.interactive = True +ibis.options.repr.interactive.max_rows = 3 + +fs = gcsfs.GCSFileSystem() +ibis.get_backend().register_filesystem(fs) + +t = ( + ibis.read_parquet( + "gs://ibis-bench/1tbc/cache/file_id=*.parquet", + ) + .select( + "system", + "sf", + "n_partitions", + "query_number", + "execution_seconds", + "timestamp", + ) + .mutate(timestamp=ibis._["timestamp"].cast("timestamp")) + .order_by("system", "query_number") + .cache() +) + +systems = sorted(t.distinct(on="system")["system"].collect().to_pyarrow().as_py()) + +agg = ( + t.mutate( + run_num=ibis.row_number().over( + group_by=["system", "sf", "n_partitions", "query_number"], + order_by=["timestamp"], + ) + ) + .relocate(t.columns[:4], "run_num") + .group_by("system", "query_number", "run_num") + .agg(execution_seconds=ibis._["execution_seconds"].mean()) + .order_by("system", "query_number", "run_num") +) +agg2 = ( + agg.group_by("system", "query_number") + .agg(avg_execution_seconds=agg.execution_seconds.mean().round(2)) + .order_by("system", "query_number") +) +piv = agg2.pivot_wider( + names_from="system", values_from=["avg_execution_seconds"] +).order_by("query_number") + + +def x_vs_y(piv, x, y): + return ibis.ifelse( + piv[x] < piv[y], + -1, + 1, + ) * ( + ( + (piv[x] - piv[y]) + / ibis.ifelse( + piv[y] > piv[x], + piv[x], + piv[y], + ) + ).abs() + ).round(4) + + +comparisons = [ + ("ibis-datafusion", "ibis-duckdb"), + ("polars-lazy", "ibis-datafusion"), + ("polars-lazy", "ibis-duckdb"), +] + +comparisons = {f"{x}_v_{y}": x_vs_y(piv, x, y) for x, y in comparisons} + +piv2 = piv.mutate(**comparisons) +piv2 = piv2.order_by("query_number").relocate("query_number", systems) + +agg3 = ( + agg2.group_by("system") + .agg( + queries_completed=agg2["avg_execution_seconds"].count(), + execution_seconds=agg2["avg_execution_seconds"].sum().round(2), + seconds_per_query=agg2["avg_execution_seconds"].mean().round(2), + ) + .order_by(ibis.desc("queries_completed")) +) +agg3 +``` + +`ibis-duckdb` completed all 22/22 queries **in under 30 minutes**. If you need +to run batch data jobs on a similar amount of data, a laptop might be all you +need! + +`ibis-datafusion` only completed 17/22 queries, though recall [3 are failing due +to a bug that's already been +fixed](../ibis-bench/index.qmd#failing-datafusion-queries). A new Python release +for DataFusion hasn't been made yet, so we ran with the old version. Assuming +those queries would complete, only 2 queries would be failing due to lack of +memory. More investigation would be needed to determine the work needed for all +22 queries to pass under these conditions. + +`polars-lazy` only completed 13/22 queries, with 8 failing due lack of memory. +The [new streaming +engine](https://pola.rs/posts/announcing-polars-1/#new-engine-design) will +likely help with this. + +Let's plot execution time for each query and system: + +:::{.callout-tip title="You can de-select systems in the legend"} +It might be easier to look at 2 systems at a time. You can click on a system in +the legend of the plot to de-select it. +::: + +```{python} +#| code-fold: true +#| code-summary: "Show code to plot execution time by query and system" +c = px.bar( + agg2, + x="query_number", + y="avg_execution_seconds", + title="Average execution time by query", + color="system", + barmode="group", + log_y=True, +) +c +``` + +Let's show a [Great Tables](https://github.com/posit-dev/great-tables) table of +pivoted data including relative speed differences between the systems: + +```{python} +#| code-fold: true +#| code-summary: "Show code to create Great Table table from pivoted aggregated benchmark data" +color_palette = "plasma" +na_color = "black" +style_color = "cyan" + +tbl = ( + GT( + piv2.mutate(**{" ": ibis.literal("")}) + .select( + "query_number", + *systems, + " ", + *list(comparisons.keys()), + ) + .to_polars() + ) + .opt_stylize( + style=1, + color=style_color, + ) + .tab_header( + title=md("1TB (`sf=1024`) TPC-H queries"), + subtitle=md("*on a laptop* (MacBook Pro | Apple M2 Max | 96GiB RAM)"), + ) + .tab_spanner(label="execution time (seconds)", columns=systems) + .tab_spanner(label=" ", columns=" ") + .tab_spanner(label="relative speed difference†", columns=list(comparisons)) + .tab_source_note( + source_note=md( + "†[Relative speed difference formula](https://docs.coiled.io/blog/tpch#measurements), with negative values indicating A was faster than B for A_v_B" + ) + ) + .tab_source_note( + source_note=md( + "Benchmark results source data (public bucket): `gs://ibis-bench/1tbc/cache/file_id=*.parquet`" + ) + ) + .fmt_percent(list(comparisons), decimals=2, scale_values=True) + .data_color( + columns=systems, + domain=[0, agg2["avg_execution_seconds"].max().to_pyarrow().as_py()], + palette=color_palette, + na_color=na_color, + ) + .data_color( + columns=" ", + palette=["#333333", "#333333"], + ) + .data_color( + columns=list(comparisons), + domain=[ + min( + [piv2[c].min().to_pyarrow().as_py() for c in list(comparisons)], + ), + max( + [piv2[c].max().to_pyarrow().as_py() for c in list(comparisons)], + ), + ], + palette=color_palette, + na_color=na_color, + ) +) +tbl +``` + +You can use the code above to further explore and visualize the results. + +## Why does this matter? + +The ability to run all 1TB TPC-H queries on a relatively standard laptop with +minimal setup represents a significant shift in the Python data ecosystem that +benefits individual data practitioners and organizations. + +### Scale up, then scale out + +Distributed systems are hard and introduce complexity for data workloads. While +distributed OLAP query engines have their place, the threshold for considering +them against a single-node OLAP query engine has been raised drastically over +the last few years. You can [see how much DuckDB has improved over the +years](https://duckdb.org/2024/06/26/benchmarks-over-time) and it shows in this +benchmark. + +It's a good idea to start with a single node and see how far you can get. You'll +need to consider the tradeoffs for your own situation to make a decision. With +Ibis, you can write your queries once and try them on different engines to see +which is best for your workload. + +### Composable data systems are here + +Ibis separates the query from the engine. It translates dataframe code into an +intermediate representation (IR) in the backend's native language -- often SQL, +sometimes other Python dataframe code. This separation allows you **to use a +single dataframe API for the best engine(s) across your workload(s)**. + +If you need to analyze data in +[Postgres](https://ibis-project.org/backends/postgres), you can use Ibis. If you +need to [speed that up with +DuckDB](https://duckdb.org/2022/09/30/postgres-scanner.html), you can [use +Ibis](https://ibis-project.org/backends/duckdb#ibis.backends.duckdb.Backend.read_postgres). +If you need to scale out with [Dask](https://ibis-project.org/backends/dask) or +[PySpark](https://ibis-project.org/backends/pyspark) or +[Trino](https://ibis-project.org/backends/trino), you can use Ibis. If you need +to [scale out on distributed GPUs you can use +Ibis](../why-voda-supports-ibis/index.qmd). If another query engine comes along +and is best for your workload, you can probably use Ibis. New backends are +fairly easy to add! + +### It's efficient + +How much money does your organization spend on data transformation per terabyte? +Using [the GCP pricing calculator](https://cloud.google.com/products/calculator) +we'll sample the monthly cost of some cloud instances including a few TBs of +solid-state hard drive space. Hover over to see the vCPUs and RAM for each +instance. + +```{python} +#| code-fold: true +#| code-summary: "Show code to plot monthly cost of various GCP instances" +data = { + "m1-megamem-40": {"vCPUs": 40, "RAM": 961, "cost": 6200}, + "m1-ultramem-80": {"vCPUs": 80, "RAM": 1922, "cost": 10900}, + "m1-ultramem-160": {"vCPUs": 160, "RAM": 3844, "cost": 20100}, + "h3-standard-88": {"vCPUs": 88, "RAM": 352, "cost": 4600}, + "c2-standard-30": {"vCPUs": 30, "RAM": 120, "cost": 1600}, + "c2-standard-60": {"vCPUs": 60, "RAM": 240, "cost": 2700}, +} + +t = ibis.memtable( + { + "name": list(data.keys()), + "vCPUs": [v["vCPUs"] for v in data.values()], + "RAM (GBs)": [v["RAM"] for v in data.values()], + "cost": [v["cost"] for v in data.values()], + } +).order_by("cost") + +c = px.bar( + t, + x="name", + y="cost", + title="Monthly cost (USD) of various GCP instances", + hover_data=["vCPUs", "RAM (GBs)"], +) +c +``` + +For ~$1,600/month we can get a machine with more CPU cores and RAM than the +laptop benchmarked in this post. This cost assumes you're running the machine +24/7 -- if you only needed to run a workload similar to the benchmark here, +you'd only need to run the machine <1 hour per day using Ibis with the default +DuckDB backend. This can serve as a good anchor when evaluating your cost of +compute for data. + +A composable data system with Python dataframe and SQL user experiences can +scale vertically to handle workloads into 10TB+ range with modern single-node +OLAP query engines. If you need a distributed query engine or a better +single-node query engine for your workload materializes, you can swap them out +without changing your queries. However, note that with vertical scaling you're +likely to hit storage or network bottlenecks before compute bottlenecks on real +workloads. + +## Next steps + +We'll follow up on this post once new versions that fix issues or improve +performance significantly are released. If you're interested in getting started +with Ibis, see [our tutorial](/tutorials/getting_started.qmd). diff --git a/docs/posts/1tbc/polars-lazy-sort.gif b/docs/posts/1tbc/polars-lazy-sort.gif new file mode 100644 index 000000000000..b0be683b2a04 Binary files /dev/null and b/docs/posts/1tbc/polars-lazy-sort.gif differ diff --git a/docs/posts/1tbc/polars-lazy-streaming-sort.gif b/docs/posts/1tbc/polars-lazy-streaming-sort.gif new file mode 100644 index 000000000000..b92cef22af8f Binary files /dev/null and b/docs/posts/1tbc/polars-lazy-streaming-sort.gif differ