observablehq · allisonhorst · Jul 15, 2024 · Jun 1, 2024 · Jun 3, 2024 · Jun 3, 2024
diff --git a/examples/README.md b/examples/README.md
@@ -47,6 +47,7 @@
 - [`loader-google-analytics`](https://observablehq.observablehq.cloud/framework-example-loader-google-analytics/) - Loading data from Google Analytics
 - [`loader-parquet`](https://observablehq.observablehq.cloud/framework-example-loader-parquet/) - Generating Apache Parquet files
 - [`loader-postgres`](https://observablehq.observablehq.cloud/framework-example-loader-postgres/) - Loading data from PostgreSQL
+- [`loader-python-to-parquet`](https://observablehq.observablehq.cloud/framework-example-loader-python-to-parquet) - Generating Apache Parquet from Python
 - [`loader-r-to-json`](https://observablehq.observablehq.cloud/framework-example-loader-r-to-json/) - Generating JSON from R
 - [`loader-snowflake`](https://observablehq.observablehq.cloud/framework-example-loader-snowflake/) - Loading data from Snowflake
 - [`netcdf-contours`](https://observablehq.observablehq.cloud/framework-example-netcdf-contours/) - Converting NetCDF to GeoJSON with `netcdfjs` and `d3-geo-voronoi`

diff --git a/examples/loader-python-to-parquet/.gitignore b/examples/loader-python-to-parquet/.gitignore
@@ -0,0 +1,5 @@
+.DS_Store
+/dist/
+node_modules/
+yarn-error.log
+.venv
diff --git a/examples/loader-python-to-parquet/README.md b/examples/loader-python-to-parquet/README.md
@@ -0,0 +1,7 @@
+[Framework examples →](../)
+
+# Python data loader to generate Apache Parquet
+
+View live: <https://observablehq.observablehq.cloud/framework-example-loader-python-to-parquet/>
+
+This Observable Framework example demonstrates how to write a Python data loader that outputs an Apache Parquet file using the [pyarrow](https://pypi.org/project/pyarrow/) library. The loader reads in a CSV with records for over 91,000 dams in the United States from the [National Inventory of Dams](https://nid.sec.usace.army.mil/#/), selects several columns, then writes the data frame as a parquet file to standard output. The data loader lives in [`src/data/us-dams.parquet.py`](./src/data/us-dams.parquet.py).
diff --git a/examples/loader-python-to-parquet/observablehq.config.js b/examples/loader-python-to-parquet/observablehq.config.js
@@ -0,0 +1,3 @@
+export default {
+  root: "src"
+};
diff --git a/examples/loader-python-to-parquet/package.json b/examples/loader-python-to-parquet/package.json
@@ -0,0 +1,20 @@
+{
+  "type": "module",
+  "private": true,
+  "scripts": {
+    "clean": "rimraf src/.observablehq/cache",
+    "build": "rimraf dist && observable build",
+    "dev": "observable preview",
+    "deploy": "observable deploy",
+    "observable": "observable"
+  },
+  "dependencies": {
+    "@observablehq/framework": "^1.7.0"
+  },
+  "devDependencies": {
+    "rimraf": "^5.0.5"
+  },
+  "engines": {
+    "node": ">=18"
+  }
+}
diff --git a/examples/loader-python-to-parquet/requirements.txt b/examples/loader-python-to-parquet/requirements.txt
@@ -0,0 +1,2 @@
+pandas==2.2.0
+pyarrow==16.1
diff --git a/examples/loader-python-to-parquet/src/.gitignore b/examples/loader-python-to-parquet/src/.gitignore
@@ -0,0 +1 @@
+/.observablehq/cache/
diff --git a/examples/loader-python-to-parquet/src/data/us-dams.parquet.py b/examples/loader-python-to-parquet/src/data/us-dams.parquet.py
@@ -0,0 +1,18 @@
+# Load libraries (must be installed in environment)
+import pandas as pd
+import pyarrow as pa
+import pyarrow.parquet as pq
+import sys
+
+df = pd.read_csv("https://nid.sec.usace.army.mil/api/nation/csv", low_memory=False, skiprows=1).loc[:, ["Dam Name", "Primary Purpose", "Primary Dam Type", "Hazard Potential Classification"]]
+
+# Write DataFrame to a temporary file-like object
+buf = pa.BufferOutputStream()
+table = pa.Table.from_pandas(df)
+pq.write_table(table, buf)
+
+# Get the buffer as a bytes object
+buf_bytes = buf.getvalue().to_pybytes()
+
+# Write the bytes to standard output
+sys.stdout.buffer.write(buf_bytes)
diff --git a/examples/loader-python-to-parquet/src/index.md b/examples/loader-python-to-parquet/src/index.md
@@ -0,0 +1,64 @@
+# Python data loader to generate Apache Parquet
+
+Here’s a Python data loader that accesses records for over 91,000 dams from the [National Inventory of Dams](https://nid.sec.usace.army.mil/#/), limits the data to only four columns, then outputs an Apache Parquet file to standard out.
+
+```python
+# Load libraries (must be installed in environment)
+import pandas as pd
+import pyarrow as pa
+import pyarrow.parquet as pq
+import sys
+
+df = pd.read_csv("https://nid.sec.usace.army.mil/api/nation/csv", low_memory=False, skiprows=1).loc[:, ["Dam Name", "Primary Purpose", "Primary Dam Type", "Hazard Potential Classification"]]
+
+# Write pandas DataFrame to a temporary object
+buf = pa.BufferOutputStream()
+table = pa.Table.from_pandas(df)
+pq.write_table(table, buf)
+
+# Get the buffer as a bytes object
+buf_bytes = buf.getvalue().to_pybytes()
+
+# Write the bytes to standard output
+sys.stdout.buffer.write(buf_bytes)
+```
+
+<div class="note">
+
+To run this data loader you’ll need python3, and the `pandas` and `pyarrow` libraries, installed and available in your environment. We recommend setting up a virtual environment, for example using:
+
+- `$ python3 -m venv .venv`
+- `$ source .venv/bin/activate`
+
+Then install the required modules
+
+- `$ pip install -r requirements.txt`
+
+</div>
+
+The above data loader lives in `data/us-dams.parquet.py`, so we can load the data using `data/us-dams.parquet`. The `FileAttachment.parquet` method parses the file and returns a promise to an Apache Arrow table.
+
+```js echo
+const dams = FileAttachment("./data/us-dams.parquet").parquet();
+```
+
+We can display the table using `Inputs.table`.
+
+```js echo
+Inputs.table(dams)
+```
+
+Lastly, we can pass the table to `Plot.plot` to make a simple bar chart of dam counts by purpose, with color mapped to hazard classification.
+
+```js echo
+Plot.plot({
+  marginLeft: 220,
+  color: {legend: true, domain: ["Undetermined", "Low", "Significant", "High"]},
+  marks: [
+    Plot.barX(dams,
+      Plot.groupY({x: "count"}, {y: "Primary Purpose", fill: "Hazard Potential Classification", sort: {y: "x", reverse: true}
+      })
+    )
+  ]
+})
+```