Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: blog for the 1 billion row challenge #8004

Merged
merged 13 commits into from
Jan 22, 2024
1 change: 1 addition & 0 deletions docs/posts/1brc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1brc
111 changes: 111 additions & 0 deletions docs/posts/1brc/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
title: "1 billion row challenge with Ibis and DuckDB"
author: ""
date: "2024-01-40"
categories:
- blog
- duckdb
---

## Overview

This is a redux of [The One Billion Row Challenge](https://www.morling.dev/blog/one-billion-row-challenge/),

https://github.com/gunnarmorling/1brc

```{.bash}
gh repo clone gunnarmorling/1brc
```

```{.bash}
cd 1brc/src/main/python
python create_measurements.py 1_000_000_000
```

```{.python}
import ibis
import polars as pl
import pyarrow as pa

#ibis.set_backend("polars")
#ibis.set_backend("duckdb")
ibis.set_backend("datafusion")
ibis.options.interactive = True
```

```{.python}
duckdb_kwargs = {
"delim": ";",
"header": False,
"columns": {"station": "VARCHAR", "temperature": "DOUBLE"},
}

polars_kwargs = {
"separator": ";",
"has_header": False,
"new_columns": ["station", "temperature"],
"schema": {"station": pl.Utf8, "temperature": pl.Float64},
}

datafusion_kwargs = {
"delimiter": ";",
"has_header": False,
"schema": pa.schema(
[
(
"station",
pa.string(),
),
(
"temperature",
pa.float32(),
),
]
),
"file_extension": ".txt",
}

clickhouse_kwargs = {
"format": "CSV",
"types": {"station": "String", "temperature": "Float64"},
}

# kwargs = duckdb_kwargs if ibis.get_backend().name == "duckdb" else polars_kwargs
match ibis.get_backend().name:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is my first use of a match statement

case "duckdb":
kwargs = duckdb_kwargs
case "polars":
kwargs = polars_kwargs
case "datafusion":
kwargs = datafusion_kwargs


kwargs
```

```{.python}
t = ibis.read_csv("1brc/data/measurements.txt", **kwargs)
t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're already showing t in the previous block of code.

```

```{.python}
f"{t.count().to_pandas():,}"
```

```{python}
import time

t1 = time.time()
res = (
t.group_by(ibis._.station)
.agg(
min_temp=ibis._.temperature.min(),
mean_temp=ibis._.temperature.mean(),
max_temp=ibis._.temperature.max(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good place to show off selectors:

.agg(s.across(_.temp, {"min": _.min(), "mean": _.mean(), "max": _.max()}))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I've written this correctly but getting an error:

TypeError                                 Traceback (most recent call last)
[/Users/cody/repos/ibis/docs/posts/1brc/index.qmd](https://file+.vscode-resource.vscode-cdn.net/Users/cody/repos/ibis/docs/posts/1brc/index.qmd) in line 6
      [288](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=287) t = ibis.read_csv("1brc/data/measurements.txt", **kwargs)
      [289](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=288) res = (
      [290](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=289)     t
      [291](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=290)     .group_by(ibis._.station)
      [292](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=291)     .agg(
----> [293](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=292)         s.across(
      [294](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=293)             ibis._.temperature,
      [295](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=294)             {
      [296](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=295)                 "min": ibis._.min(),
     [297](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=296)                 "mean": ibis._.mean(),
     [298](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=297)                 "max": ibis._.max(),
     [299](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=298)             },
     [300](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=299)         )
     [301](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=300)     )
     [302](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=301)     .order_by(ibis._.station.desc())
     [303](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=302) )
     [305](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=304) res

File [~/repos/ibis/ibis/selectors.py:498](https://file+.vscode-resource.vscode-cdn.net/Users/cody/repos/ibis/~/repos/ibis/ibis/selectors.py:498), in across(selector, func, names)
    [496](file:///Users/cody/repos/ibis/ibis/selectors.py?line=495) funcs = dict(func if isinstance(func, Mapping) else {None: func})
    [497](file:///Users/cody/repos/ibis/ibis/selectors.py?line=496) if not isinstance(selector, Selector):
--> [498](file:///Users/cody/repos/ibis/ibis/selectors.py?line=497)     selector = c(*util.promote_list(selector))
...
--> [396](file:///Users/cody/repos/ibis/ibis/selectors.py?line=395)     names = frozenset(col if isinstance(col, str) else col.get_name() for col in names)
    [398](file:///Users/cody/repos/ibis/ibis/selectors.py?line=397)     def func(col: ir.Value) -> bool:
    [399](file:///Users/cody/repos/ibis/ibis/selectors.py?line=398)         schema = col.op().table.schema

TypeError: unhashable type: 'Deferred'

)
.order_by(ibis._.station.desc())
)
print(res)
t2 = time.time()
t2 - t1
```
Loading