Skip to content

Commit

Permalink
Mindsdb integration (catherinedevlin#80)
Browse files Browse the repository at this point in the history
* Moved integrations into a section + added mindsdb

* Fixing toc issue

* Editing guide

* Review fixes

* Skip execution + ipynb format
  • Loading branch information
idomic authored Jan 23, 2023
1 parent 8a42ec4 commit 16bd4e6
Show file tree
Hide file tree
Showing 5 changed files with 810 additions and 28 deletions.
1 change: 1 addition & 0 deletions doc/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ execute:

exclude_patterns:
- 'howto/*-connect.ipynb'
- 'integrations/mindsdb.ipynb'

# Define the name of the latex output file for PDF builds
latex:
Expand Down
19 changes: 9 additions & 10 deletions doc/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,25 +9,24 @@ parts:
- file: intro
- file: connecting
- file: plot
- file: duckdb
- file: pandas
- file: plot-large
- file: csv
- file: compose
- file: plot-legacy

- caption: Integrations
chapters:
- file: integrations/duckdb
- file: integrations/pandas
- file: integrations/mindsdb

- caption: API Reference
chapters:
- file: api/magic-sql
- file: api/magic-plot
- file: api/magic-render
- file: api/configuration
- file: api/python
- file: api
- file: configuration

- caption: How-To
chapters:
- file: howto
- file: howto/postgres-install
- file: howto/postgres-connect

- caption: Community
chapters:
Expand Down
105 changes: 87 additions & 18 deletions doc/duckdb.md → doc/integrations/duckdb.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,10 +159,14 @@ SELECT * FROM track LIMIT 5

## Plotting large datasets

```{versionadded} 0.5.2
*New in version 0.4.4*

```{note}
This is a beta feature, please [join our community](https://ploomber.io/community) and let us know what plots we should add next!
```

This section demonstrates how we can efficiently plot large datasets with DuckDB and JupySQL without blowing up our machine's memory. `%sqlplot` performs all aggregations in DuckDB.

This section demonstrates how we can efficiently plot large datasets with DuckDB and JupySQL without blowing up our machine's memory.

Let's install the required package:

Expand Down Expand Up @@ -191,17 +195,39 @@ In total, this contains more then 4.6M observations:
SELECT count(*) FROM 'yellow_tripdata_2021-*.parquet'
```

Now, let's keep track of how much memory this Python session is using:

```{code-cell} ipython3
import psutil
import os
def memory_usage():
"""Print how much memory we're using
"""
process = psutil.Process(os.getpid())
total = process.memory_info().rss / 10 ** 9
print(f'Using: {total:.1f} GB')
```

```{code-cell} ipython3
memory_usage()
```

```{code-cell} ipython3
from sql import plot
```

Let's use JupySQL to get a histogram of `trip_distance` across all 12 files:

```{code-cell} ipython3
%sqlplot histogram --table yellow_tripdata_2021-*.parquet --column trip_distance --bins 50
plot.histogram('yellow_tripdata_2021-*.parquet', 'trip_distance', bins=50)
```

We have some outliers, let's find the 99th percentile:

```{code-cell} ipython3
%%sql
SELECT percentile_disc(0.99) WITHIN GROUP (ORDER BY trip_distance)
SELECT percentile_disc(0.99) WITHIN GROUP (ORDER BY trip_distance),
FROM 'yellow_tripdata_2021-*.parquet'
```

Expand All @@ -214,35 +240,78 @@ FROM 'yellow_tripdata_2021-*.parquet'
WHERE trip_distance < 18.93
```

### Histogram
Now we create a new histogram:

```{code-cell} ipython3
%sqlplot histogram --table no_outliers --column trip_distance --bins 50 --with no_outliers
plot.histogram('no_outliers', 'trip_distance', bins=50, with_=['no_outliers'])
```

### Boxplot

```{code-cell} ipython3
%sqlplot boxplot --table no_outliers --column trip_distance --with no_outliers
memory_usage()
```

## Querying existing dataframes
We see that memory usage increase just a bit.

+++

### Benchmark: Using pandas

We now repeat the same process using pandas.

```{code-cell} ipython3
import pandas as pd
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import pyarrow.parquet
```

engine = create_engine("duckdb:///:memory:")
engine.execute("register", ("df", pd.DataFrame({"x": range(100)})))
Data loading:

```{code-cell} ipython3
tables = []
# https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
for i in range(1, N_MONTHS):
filename = f'yellow_tripdata_2021-{str(i).zfill(2)}.parquet'
t = pyarrow.parquet.read_table(filename)
tables.append(t)
table = pyarrow.concat_tables(tables)
df = pyarrow.concat_tables(tables).to_pandas()
```

First histogram:

```{code-cell} ipython3
%sql engine
_ = plt.hist(df.trip_distance, bins=50)
```

```{code-cell} ipython3
%%sql
SELECT *
FROM df
WHERE x > 95
cutoff = df.trip_distance.quantile(.99)
cutoff
```

```{code-cell} ipython3
subset = df.trip_distance[df.trip_distance < cutoff]
```

```{code-cell} ipython3
_ = plt.hist(subset, bins=50)
```

```{code-cell} ipython3
memory_usage()
```

**We're using 1.6GB of memory just by loading the data with pandas!**

Try re-running the notebook with the full 12 months (change `N_MONTHS` to `12` in the earlier cell), and you'll see that memory usage blows up to 8GB.

Even deleting the dataframes does not completely free up the memory ([explanation here](https://stackoverflow.com/a/39377643/709975)):

```{code-cell} ipython3
del df, subset
```

```{code-cell} ipython3
memory_usage()
```
Loading

0 comments on commit 16bd4e6

Please sign in to comment.