Skip to content

Commit

Permalink
docs(presentations): positconf 2024 talk (#9822)
Browse files Browse the repository at this point in the history
  • Loading branch information
cpcloud authored Aug 13, 2024
1 parent 1920c8d commit e8b89e7
Show file tree
Hide file tree
Showing 4 changed files with 316 additions and 0 deletions.
12 changes: 12 additions & 0 deletions docs/presentations/positconf2024/custom.scss
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
/*-- scss:rules --*/
.reveal div.sourceCode {
font-size: 2.4rem !important;
}

.cell-output-display {
font-size: 2.2rem !important;
display: block;
margin-left: 30%;
margin-right: 25%;
margin-top: 2.5%;
}
Binary file added docs/presentations/positconf2024/fine.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
304 changes: 304 additions & 0 deletions docs/presentations/positconf2024/talk.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,304 @@
---
title: "Test 20 databases on every commit"
execute:
echo: true
format:
revealjs:
theme: [default, custom.scss]
footer: <https://ibis-project.org/presentations/positconf2024/talk>
---

# Let's all stand!

## Sit if you work with…

::: {.incremental}
- 0 DBs ✅
- 1 DB 😇
- 2 DBs 😬
- 5+ DBs 😱
:::

::: {.fragment}
::: {.r-fit-text}
_I feel your pain._
:::
:::

## Who?

:::: {.columns}

::: {.column width="50%"}
### Me

- Phillip Cloud
- Ibis project
- Voltron Data
- Data tools for 10+ years
:::

::: {.column width="50%"}
### Where

- {{< fa brands github >}} [`@cpcloud`](https://github.com/cpcloud)
- {{< fa brands youtube >}} [Phillip in the Cloud](https://www.youtube.com/@cpcloud)
- {{< fa brands twitter >}} [`@cpcloudy`](https://x.com/cpcloudy)
:::

::::

# Ever needed to test a complex system?

## Maybe this is you

![](../pycon2024/docker-eye-roll.gif){fig-align="center"}

## Or this

![](../pycon2024/wonka.png){fig-align="center"}

## Or maybe even this

![](https://storage.googleapis.com/posit-conf-2024/fine.jpg){fig-align="center"}

# A complex system: Ibis

![](../../logo.svg){fig-align="center" width="50%" height="50%"}

## What's Ibis?

- Python library
- Exploratory data analysis
- Data engineering
- ML preprocessing

::: {.fragment}
::: {.r-fit-text}
_dbplyr, but Python_
:::
:::

## One API, 20+ backends {.smaller .scrollable}

```{python}
#| code-fold: true
#| echo: false
import ibis
ibis.options.interactive = True
t = ibis.examples.penguins.fetch()
t.to_parquet("penguins.parquet")
```

::: {.panel-tabset}

## DuckDB

```{python}
con = ibis.connect("duckdb://")
t = con.read_parquet("penguins.parquet")
t.group_by("species", "island").agg(count=t.count()).order_by("count")
```

## Polars

```{python}
#| code-line-numbers: "1,1"
con = ibis.connect("polars://")
t = con.read_parquet("penguins.parquet")
t.group_by("species", "island").agg(count=t.count()).order_by("count")
```

## DataFusion

```{python}
#| code-line-numbers: "1,1"
con = ibis.connect("datafusion://")
t = con.read_parquet("penguins.parquet")
t.group_by("species", "island").agg(count=t.count()).order_by("count")
```

## PySpark

```{python}
#| code-line-numbers: "1,1"
con = ibis.connect("pyspark://")
t = con.read_parquet("penguins.parquet")
t.group_by("species", "island").agg(count=t.count()).order_by("count")
```

## 16+ other DBs

![](../pycon2024/machine.gif){fig-align="center" width="100%" height="100%"}

:::

# Why is this hard to test?

## By the numbers {.smaller}

:::: {.columns}
::: {.column width="50%"}
### Backends
- **17** SQL
- **3** non-SQL
- **2** cloud
:::

::: {.column width="50%"}
### Engines + APIs
- **9** distributed SQL
- **3** dataframe
- oldest: **~45** years 👀
- youngest: **~2** years
:::
::::

### Other facts

- Latency is variable
- Deployment models vary

::: {.fragment}
::: {.r-fit-text}
_**Feature development**_
:::
:::

## Bit of a pickle

![](../pycon2024/picklerick.png)

# How

## High level

### Goal: fast iteration

- fast env setup (dependency management)
- fast(ish) tests (test-running library)
- high **job** concurrency (ci/provider)
- **easy to run**: dev speed ([`just`](https://github.com/casey/just))

::: {.fragment}
::: {.r-fit-text}
_CI must complete "quickly"_
:::
:::

## Tools: overview

- 📦 **deps**: _poetry_
- 🖥️ **ci**: _GitHub Actions_
- 🦁 **"big" backends**: _docker_
- 🐱 **"small" backends**: _no special tx (duckdb, polars)_
- 🏃 **tasks**: [`just`](https://github.com/casey/just) (e.g.: `just up postgres`)

## Tools: poetry

- **Env setup must be _fast_**: no constraint solving
- Poetry is one way; there are others
- Get yourself a lockfile
- Downsides?

::: {.fragment}
::: {.r-fit-text}
_Are you doing that **now**_
:::
:::

## Tools: docker

- Do you use it locally?
- Use health checks; "dumb" ones are fine
- Make it easy for devs to use

![](https://storage.googleapis.com/posit-conf-2024/terminal.png){fig-align="center"}

## Tools: GitHub Actions {.smaller}

- Pay for the [the Teams plan](https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits) to get more concurrency
- Automate dependency updates

::: {.columns}
::: {.column width="50%"}
### GHA limits

![](../pycon2024/gha.png)
:::

::: {.column width="50%"}
### Ibis CI cost

![](../pycon2024/bill.png)
:::
:::

# How does this stack up?

## Terminology

::: {.fragment}
Job
: a set of commands

```yaml
my_job:
- run: pip install ibis-framework
- run: just ci-check -m ${{ matrix.backend.name }}
- run: coverage upload
```
:::
::: {.fragment}
Workflow
: A collection of jobs, one `.yml` file

```yaml
name: Backends
my_job:
- run: ...
my_other_job:
- run: ...
```
:::

## Job metrics

![](https://storage.googleapis.com/posit-conf-2024/jobs.svg){fig-align="center"}

::: {.fragment}
::: {.r-fit-text}
_We've added 3 or 4 new backends since the switch_
:::
:::

## Workflow metrics

![Queue time and workflow duration](https://storage.googleapis.com/posit-conf-2024/workflows.svg){fig-align="center"}

## Workflow metrics {auto-animate=true}

![](https://storage.googleapis.com/posit-conf-2024/workflowscorr.svg){fig-align="center"}

## Workflow metrics {auto-animate=true}

![](https://storage.googleapis.com/posit-conf-2024/workflowscorr.svg){fig-align="center"}

- 🟢 Queues + workflows correlated
- 🟡 Queues slow + workflows fast: not enough concurrency
- 🟡 Queues fast + workflows slow: jobs doing too much
- 🔴 Queues slow + workflows slow: hard to say

# Summary

- Testing complex projects is possible
- Use docker for dev **and** prod
- Don't SAT solve in CI
- Track CI run durations, workflow metrics
- Spend time on dev ex

# Questions?
Binary file added docs/presentations/positconf2024/terminal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit e8b89e7

Please sign in to comment.