Mindsdb integration (catherinedevlin#80)

* Moved integrations into a section + added mindsdb * Fixing toc issue * Editing guide * Review fixes * Skip execution + ipynb format
pmfischer · Jan 23, 2023 · 16bd4e6 · 16bd4e6
1 parent 8a42ec4
commit 16bd4e6
Show file tree

Hide file tree

Showing 5 changed files with 810 additions and 28 deletions.
diff --git a/doc/_config.yml b/doc/_config.yml
@@ -11,6 +11,7 @@ execute:
 
   exclude_patterns:
     - 'howto/*-connect.ipynb'
+    - 'integrations/mindsdb.ipynb'
 
 # Define the name of the latex output file for PDF builds
 latex:

diff --git a/doc/_toc.yml b/doc/_toc.yml
@@ -9,25 +9,24 @@ parts:
     - file: intro
     - file: connecting
     - file: plot
-    - file: duckdb
-    - file: pandas
+    - file: plot-large
     - file: csv
     - file: compose
-    - file: plot-legacy
+
+  - caption: Integrations
+    chapters:
+      - file: integrations/duckdb
+      - file: integrations/pandas
+      - file: integrations/mindsdb
 
   - caption: API Reference
     chapters:
-    - file: api/magic-sql
-    - file: api/magic-plot
-    - file: api/magic-render
-    - file: api/configuration
-    - file: api/python
+    - file: api
+    - file: configuration
 
   - caption: How-To
     chapters:
     - file: howto
-    - file: howto/postgres-install
-    - file: howto/postgres-connect
 
   - caption: Community
     chapters:

diff --git a/doc/duckdb.md → doc/integrations/duckdb.md b/doc/duckdb.md → doc/integrations/duckdb.md
@@ -159,10 +159,14 @@ SELECT * FROM track LIMIT 5
 
 ## Plotting large datasets
 
-```{versionadded} 0.5.2
+*New in version 0.4.4*
+
+```{note}
+This is a beta feature, please [join our community](https://ploomber.io/community) and let us know what plots we should add next!
 ```
 
-This section demonstrates how we can efficiently plot large datasets with DuckDB and JupySQL without blowing up our machine's memory. `%sqlplot` performs all aggregations in DuckDB.
+
+This section demonstrates how we can efficiently plot large datasets with DuckDB and JupySQL without blowing up our machine's memory.
 
 Let's install the required package:
 
@@ -191,17 +195,39 @@ In total, this contains more then 4.6M observations:
 SELECT count(*) FROM 'yellow_tripdata_2021-*.parquet'
 ```
 
+Now, let's keep track of how much  memory this Python session is using:
+
+```{code-cell} ipython3
+import psutil
+import os
+
+def memory_usage():
+    """Print how much memory we're using
+    """
+    process = psutil.Process(os.getpid())
+    total = process.memory_info().rss / 10 ** 9
+    print(f'Using: {total:.1f} GB')
+```
+
+```{code-cell} ipython3
+memory_usage()
+```
+
+```{code-cell} ipython3
+from sql import plot
+```
+
 Let's use JupySQL to get a histogram of `trip_distance` across all 12 files:
 
 ```{code-cell} ipython3
-%sqlplot histogram --table yellow_tripdata_2021-*.parquet --column trip_distance --bins 50
+plot.histogram('yellow_tripdata_2021-*.parquet', 'trip_distance', bins=50)
 ```
 
 We have some outliers, let's find the 99th percentile:
 
 ```{code-cell} ipython3
 %%sql
-SELECT percentile_disc(0.99) WITHIN GROUP (ORDER BY trip_distance)
+SELECT percentile_disc(0.99) WITHIN GROUP (ORDER BY trip_distance),
 FROM 'yellow_tripdata_2021-*.parquet'
 ```
 
@@ -214,35 +240,78 @@ FROM 'yellow_tripdata_2021-*.parquet'
 WHERE trip_distance < 18.93
 ```
 
-### Histogram
+Now we create a new histogram:
 
 ```{code-cell} ipython3
-%sqlplot histogram --table no_outliers --column trip_distance --bins 50 --with no_outliers
+plot.histogram('no_outliers', 'trip_distance', bins=50, with_=['no_outliers'])
 ```
 
-### Boxplot
-
 ```{code-cell} ipython3
-%sqlplot boxplot --table no_outliers --column trip_distance --with no_outliers
+memory_usage()
 ```
 
-## Querying existing dataframes
+We see that memory usage increase just a bit.
+
++++
+
+### Benchmark: Using pandas
+
+We now repeat the same process using pandas.
 
 ```{code-cell} ipython3
 import pandas as pd
-from sqlalchemy import create_engine
+import matplotlib.pyplot as plt
+import pyarrow.parquet
+```
 
-engine = create_engine("duckdb:///:memory:")
-engine.execute("register", ("df", pd.DataFrame({"x": range(100)})))
+Data loading:
+
+```{code-cell} ipython3
+tables = []
+
+# https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
+for i in range(1, N_MONTHS):
+    filename = f'yellow_tripdata_2021-{str(i).zfill(2)}.parquet'
+    t = pyarrow.parquet.read_table(filename)
+    tables.append(t)
+
+table = pyarrow.concat_tables(tables)
+df = pyarrow.concat_tables(tables).to_pandas()
 ```
 
+First histogram:
+
 ```{code-cell} ipython3
-%sql engine
+_ = plt.hist(df.trip_distance, bins=50)
 ```
 
 ```{code-cell} ipython3
-%%sql
-SELECT *
-FROM df
-WHERE x > 95
+cutoff = df.trip_distance.quantile(.99)
+cutoff
+```
+
+```{code-cell} ipython3
+subset = df.trip_distance[df.trip_distance < cutoff]
+```
+
+```{code-cell} ipython3
+_ = plt.hist(subset, bins=50)
+```
+
+```{code-cell} ipython3
+memory_usage()
+```
+
+**We're using 1.6GB of memory just by loading the data with pandas!**
+
+Try re-running the notebook with the full 12 months (change `N_MONTHS` to `12` in the earlier cell), and you'll see that memory usage blows up to 8GB.
+
+Even deleting the dataframes does not completely free up the memory ([explanation here](https://stackoverflow.com/a/39377643/709975)):
+
+```{code-cell} ipython3
+del df, subset
+```
+
+```{code-cell} ipython3
+memory_usage()
 ```