Skip to content

Commit

Permalink
duckdb tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
pgzmnk committed Apr 9, 2024
1 parent 6f7cb75 commit a468324
Show file tree
Hide file tree
Showing 3 changed files with 484 additions and 0 deletions.
314 changes: 314 additions & 0 deletions docs/basics/tutorials/duckdb.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,314 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DuckDB\n",
"\n",
"This notebook tutorial is support material for the \"DuckDB + Fused: Fly beyond the serverless horizon\" blog post."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Run DuckDB in a Fused UDF\n",
"\n",
"As an example of running DuckDB within a Fused UDF, take the case of loading a geospatial Parquet dataset. The \"DuckDB H3\" sample UDF runs an SQL query with DuckDB on the NYC Taxi Trip Record Dataset. It uses the bbox argument to spatially filter the dataset and automatically parallelize the operation.\n",
"\n",
"To try this example, you can run the cell below. You can find the code of the UDF in the Fused public UDF [repo](https://github.com/fusedio/udfs/tree/main/public/DuckDB_H3_Example_Tile).\n",
"\n",
"Alternatively, you can import the \"DuckDB H3 Example Tile\" UDF into your Fused Workbench environment. \n",
"\n",
"This pattern gives DuckDB easy parallel operations. Fused spatially filters via the bbox parameter to enable automatic parallelization. Fused breaks down operations to only a fraction of the dataset, so it's easy to transition between SQL and Python.\n",
"\n",
"<img src=\"https://fused-magic.s3.us-west-2.amazonaws.com/docs_assets/nyc.png\" alt=\"overture\" width=\"600\"/>"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"# !pip install fused "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import fused\n",
"\n",
"udf = fused.load(\"https://github.com/fusedio/udfs/tree/main/public/DuckDB_H3_Example_Tile\")\n",
"gdf = fused.run(udf=udf, x=2412, y=3078, z=13)\n",
"gdf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Call Fused UDFs from DuckDB\n",
"\n",
"Any database that supports querying data via HTTP can call and load data from Fused UDF endpoints using common formats like Parquet or CSV. This means that DuckDB can dispatch operations to Fused that otherwise would be too complex or impossible to express with SQL, or would be unsupported in the local runtime.\n",
"\n",
"As an example of calling a Fused endpoint from within DuckDB, take an operation to vectorize a raster dataset. This might be necessary to determine the bounds of areas with pixel value within a certain threshold range in an Earth observation image - such as a Digital Elevation Model. SQL is not geared to support raster operations, but these are easy to do in Python.\n",
"\n",
"\n",
"<img src=\"https://fused-magic.s3.us-west-2.amazonaws.com/docs_assets/gifs/sql.gif\" alt=\"overture\" width=\"600\"/>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, a Fused UDF returns a table where each record is a polygon generated from the contour of a raster provided by the Copernicus Digital Elevation Model as a Cloud Optimized GeoTIFF. DuckDB can easily trigger a UDF and load its output with this simple query, which specifies that the UDF endpoint returns a Parquet file.\n",
"\n",
"This SQL query uses DuckDB's read_parquet function to call an endpoint of a UDF instance of the \"DEM Raster to Vector\" UDF.\n",
"\n",
"You can find the code of the UDF in the Fused public UDF [repo](https://github.com/fusedio/udfs/tree/main/public/DEM_Raster_to_Vector_Example).\n",
"\n",
"To try this example, simply run the following SQL query on the cell below or in a [DuckDB shell](https://shell.duckdb.org/#queries=v0,CREATE-TABLE-dem_polygons-AS%0ASELECT-wkt,-area%0AFROM-read_csv('https://www.fused.io/server/v1/realtime%20shared/'-%7C%7C%0A----'1e35c9b9cadf900265443073b0bd99072f859b8beddb72a45e701fb5bcde807d'-%7C%7C%0A----'/run/file?dtype_out_vector=csv'-%7C%7C%0A----'&min_elevation=500')~%0A). Change the `min_elevation` parameter to run the UDF for parts of California at different elevations. (Note: for DuckDB WASM, the file will be requested as CSV.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import duckdb\n",
"\n",
"con = duckdb.connect()\n",
"\n",
"con.sql(\"\"\"\n",
" SELECT \n",
" wkt, \n",
" ROUND(area,1) AS area\n",
" FROM read_parquet('https://www.fused.io/server/v1/realtime-shared/1e35c9b9cadf900265443073b0bd99072f859b8beddb72a45e701fb5bcde807d/run/file?min_elevation=500&dtype_out_vector=parquet')\n",
" LIMIT 5\n",
"\"\"\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This pattern enables DuckDB to address use cases and data formats that it doesn't natively support or would otherwise see high data transfer cost, such as raster operations, API calls, and control flow logic."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Integrate DuckDB in applications using Fused\n",
"\n",
"Fused is the glue layer between DuckDB and apps. This enables seamless integrations that trigger Fused UDFs and load their results with simple parameterized HTTP calls.\n",
"\n",
"DuckDB is an embedded database engine and doesn't have built-in capability to share results other than writing out files. As a corollary of the preceding example, it's possible to query and transform data with DuckDB and seamlessly integrate the results of queries into any workflow or app.\n",
"\n",
"As an example, take the case of loading the output of a DuckDB query into Google Sheets. Sheets can easily structure the Fused UDF endpoint to pass parameters defined in specific cells as URL query parameters. In this example, the importData command calls the same UDF from above and loads its output data in CSV format.\n",
"\n",
"\n",
"\n",
"<img src=\"https://fused-magic.s3.us-west-2.amazonaws.com/docs_assets/gifs/sheets.gif\" alt=\"overture\" width=\"600\"/>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To try this example simply make a copy of this Google Sheets spreadsheet (File > Make a copy) and click, and modify the parameters in B2:4 to trigger the Fused UDF endpoint and load data.\n",
"\n",
"You can learn more about the Google Sheets integration in the [documentation](/basics/out/googlesheets/).\n",
"\n",
"This pattern brings the power of the DuckDB analytical engine into non-analytical and no-code software like Google Sheets, Retool, and beyond - without the need to build bespoke integrations with closed-source systems. With this, a Python developer can abstract away the UDF and deliver data to end users. This removes the need to even install DuckDB."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>cell_id</th>\n",
" <th>cnt</th>\n",
" <th>geometry</th>\n",
" <th>fused_index</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>892a100d657ffff</td>\n",
" <td>124150</td>\n",
" <td>POLYGON ((-73.97990 40.76506, -73.98200 40.764...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>892a100d22bffff</td>\n",
" <td>97189</td>\n",
" <td>POLYGON ((-73.98945 40.73572, -73.99155 40.734...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>89754e64993ffff</td>\n",
" <td>268965</td>\n",
" <td>POLYGON ((0.00010 -0.00031, 0.00036 0.00134, -...</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>892a1072c7bffff</td>\n",
" <td>8915</td>\n",
" <td>POLYGON ((-74.01244 40.72188, -74.01454 40.720...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>892a100d2afffff</td>\n",
" <td>34257</td>\n",
" <td>POLYGON ((-73.97887 40.73865, -73.98097 40.737...</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3945</th>\n",
" <td>892a10012d7ffff</td>\n",
" <td>12</td>\n",
" <td>POLYGON ((-73.88583 40.88399, -73.88794 40.883...</td>\n",
" <td>3945</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3946</th>\n",
" <td>892a100ee77ffff</td>\n",
" <td>13</td>\n",
" <td>POLYGON ((-73.82644 40.72416, -73.82855 40.723...</td>\n",
" <td>3946</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3947</th>\n",
" <td>892a107222bffff</td>\n",
" <td>11</td>\n",
" <td>POLYGON ((-74.06501 40.75216, -74.06711 40.751...</td>\n",
" <td>3947</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3948</th>\n",
" <td>892a107204bffff</td>\n",
" <td>11</td>\n",
" <td>POLYGON ((-74.04833 40.76320, -74.05044 40.762...</td>\n",
" <td>3948</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3949</th>\n",
" <td>892a100e297ffff</td>\n",
" <td>11</td>\n",
" <td>POLYGON ((-73.84098 40.77137, -73.84309 40.770...</td>\n",
" <td>3949</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3950 rows × 4 columns</p>\n",
"</div>"
],
"text/plain": [
" cell_id cnt \\\n",
"0 892a100d657ffff 124150 \n",
"1 892a100d22bffff 97189 \n",
"2 89754e64993ffff 268965 \n",
"3 892a1072c7bffff 8915 \n",
"4 892a100d2afffff 34257 \n",
"... ... ... \n",
"3945 892a10012d7ffff 12 \n",
"3946 892a100ee77ffff 13 \n",
"3947 892a107222bffff 11 \n",
"3948 892a107204bffff 11 \n",
"3949 892a100e297ffff 11 \n",
"\n",
" geometry fused_index \n",
"0 POLYGON ((-73.97990 40.76506, -73.98200 40.764... 0 \n",
"1 POLYGON ((-73.98945 40.73572, -73.99155 40.734... 1 \n",
"2 POLYGON ((0.00010 -0.00031, 0.00036 0.00134, -... 2 \n",
"3 POLYGON ((-74.01244 40.72188, -74.01454 40.720... 3 \n",
"4 POLYGON ((-73.97887 40.73865, -73.98097 40.737... 4 \n",
"... ... ... \n",
"3945 POLYGON ((-73.88583 40.88399, -73.88794 40.883... 3945 \n",
"3946 POLYGON ((-73.82644 40.72416, -73.82855 40.723... 3946 \n",
"3947 POLYGON ((-74.06501 40.75216, -74.06711 40.751... 3947 \n",
"3948 POLYGON ((-74.04833 40.76320, -74.05044 40.762... 3948 \n",
"3949 POLYGON ((-73.84098 40.77137, -73.84309 40.770... 3949 \n",
"\n",
"[3950 rows x 4 columns]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import fused\n",
"\n",
"udf = fused.load(\"https://github.com/fusedio/udfs/tree/main/public/DuckDB_H3_Example\")\n",
"gdf = fused.run(udf=udf, engine='realtime')\n",
"gdf"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.19"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit a468324

Please sign in to comment.