-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: blog for the 1 billion row challenge #8004
Conversation
docs/posts/1brc/index.qmd
Outdated
.agg( | ||
min_temp=ibis._.temperature.min(), | ||
mean_temp=ibis._.temperature.mean(), | ||
max_temp=ibis._.temperature.max(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good place to show off selectors:
.agg(s.across(_.temp, {"min": _.min(), "mean": _.mean(), "max": _.max()}))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I've written this correctly but getting an error:
TypeError Traceback (most recent call last)
[/Users/cody/repos/ibis/docs/posts/1brc/index.qmd](https://file+.vscode-resource.vscode-cdn.net/Users/cody/repos/ibis/docs/posts/1brc/index.qmd) in line 6
[288](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=287) t = ibis.read_csv("1brc/data/measurements.txt", **kwargs)
[289](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=288) res = (
[290](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=289) t
[291](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=290) .group_by(ibis._.station)
[292](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=291) .agg(
----> [293](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=292) s.across(
[294](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=293) ibis._.temperature,
[295](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=294) {
[296](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=295) "min": ibis._.min(),
[297](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=296) "mean": ibis._.mean(),
[298](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=297) "max": ibis._.max(),
[299](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=298) },
[300](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=299) )
[301](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=300) )
[302](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=301) .order_by(ibis._.station.desc())
[303](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=302) )
[305](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=304) res
File [~/repos/ibis/ibis/selectors.py:498](https://file+.vscode-resource.vscode-cdn.net/Users/cody/repos/ibis/~/repos/ibis/ibis/selectors.py:498), in across(selector, func, names)
[496](file:///Users/cody/repos/ibis/ibis/selectors.py?line=495) funcs = dict(func if isinstance(func, Mapping) else {None: func})
[497](file:///Users/cody/repos/ibis/ibis/selectors.py?line=496) if not isinstance(selector, Selector):
--> [498](file:///Users/cody/repos/ibis/ibis/selectors.py?line=497) selector = c(*util.promote_list(selector))
...
--> [396](file:///Users/cody/repos/ibis/ibis/selectors.py?line=395) names = frozenset(col if isinstance(col, str) else col.get_name() for col in names)
[398](file:///Users/cody/repos/ibis/ibis/selectors.py?line=397) def func(col: ir.Value) -> bool:
[399](file:///Users/cody/repos/ibis/ibis/selectors.py?line=398) schema = col.op().table.schema
TypeError: unhashable type: 'Deferred'
docs/posts/1brc/index.qmd
Outdated
header=False, | ||
columns={"station": "VARCHAR", "temperature": "DOUBLE"}, | ||
) | ||
elif ibis.get_backend().name == "polars": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you instead collect the kwargs into a dict and then have a single call to ibis.read_csv
? It's pretty noisy with the repetition.
docs/posts/1brc/index.qmd
Outdated
``` | ||
|
||
```{python} | ||
t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're already showing t
in the previous block of code.
Dropping couple of links I found in the python submissions, that might help as reference on what people did:
Here are the dask and spark solutions, idk if we want to get into try to run those with Ibis but just in case, gunnarmorling/1brc#450 (comment) |
|
docs/posts/1brc/index.qmd
Outdated
separator=";", | ||
has_header=False, | ||
new_columns=["station", "temperature"], | ||
schema={"station": pl.datatypes.Utf8, "temperature": pl.datatypes.Float64}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema={"station": pl.datatypes.Utf8, "temperature": pl.datatypes.Float64}, | |
schema={"station": pl.Utf8, "temperature": pl.Float64}, |
Interesting, I haven't used polars much, I was just reading the thread on what people tried, looks like Ritchie chimed in that thread explaining why
|
this code is so simple I think we should try to show it across as many backends as possible -- DuckDB, Polars, DataFusion, SQLite (?), Postgres, Clickhouse should all work pretty easily? then we can have a fancy title like "Using one Python dataframe API to take the billion row challenge with DuckDB, Polars, DataFusion, Clickhouse, SQLite, Postgres" |
I would show it across the columnar local backends, only to avoid having to write a bunch of ingestion code for backends that don't support |
so far:
|
@lostmygithubaccount what are the specs of the machine you are running this on? The results for an M1 with 32GB RAM for polars reported here way different https://github.com/ifnesi/1brc/tree/main#performance-on-a-macbook-pro-m1-32gb |
also a macbook pro M1 32GB -- just pushed duckdb/polars/datafusion, need to be away for a bit, will try to finish this up later. unclear how to pass in the right kwargs for Clickhouse right now |
docs/posts/1brc/index.qmd
Outdated
} | ||
|
||
# kwargs = duckdb_kwargs if ibis.get_backend().name == "duckdb" else polars_kwargs | ||
match ibis.get_backend().name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is my first use of a match statement
docs/posts/1brc/index.qmd
Outdated
|
||
::: {.panel-tabset} | ||
|
||
## DuckDb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## DuckDb | |
## DuckDB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arggghhhh thanks...rerendering for another 20 minutes 😂
I think the post looks great @lostmygithubaccount . The last small comment I have is, whether the conclusion section should be before the Bonus content. I think the Bonus is neat, but feels disconnected from the rest, and people might skip it and not read the conclusion. But this is just an opinion, feel free to ignore. |
I think this is good to merge today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description of changes
work in progress, just throwing up the code
Issues closed