Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technique example: data loader, Python to parquet #1422

Merged
merged 14 commits into from
Jul 15, 2024
Merged
1 change: 1 addition & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
- [`loader-google-analytics`](https://observablehq.observablehq.cloud/framework-example-loader-google-analytics/) - Loading data from Google Analytics
- [`loader-parquet`](https://observablehq.observablehq.cloud/framework-example-loader-parquet/) - Generating Apache Parquet files
- [`loader-postgres`](https://observablehq.observablehq.cloud/framework-example-loader-postgres/) - Loading data from PostgreSQL
- [`loader-python-to-parquet`](https://observablehq.observablehq.cloud/framework-example-loader-python-to-parquet) - Generating Apache Parquet from Python
- [`loader-r-to-json`](https://observablehq.observablehq.cloud/framework-example-loader-r-to-json/) - Generating JSON from R
- [`loader-snowflake`](https://observablehq.observablehq.cloud/framework-example-loader-snowflake/) - Loading data from Snowflake
- [`netcdf-contours`](https://observablehq.observablehq.cloud/framework-example-netcdf-contours/) - Converting NetCDF to GeoJSON with `netcdfjs` and `d3-geo-voronoi`
Expand Down
5 changes: 5 additions & 0 deletions examples/loader-python-to-parquet/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.DS_Store
/dist/
node_modules/
yarn-error.log
.venv
7 changes: 7 additions & 0 deletions examples/loader-python-to-parquet/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
[Framework examples →](../)

# Python data loader to generate Apache Parquet

View live: <https://observablehq.observablehq.cloud/framework-example-loader-python-to-parquet/>

This Observable Framework example demonstrates how to write a Python data loader that outputs an Apache Parquet file using the [pyarrow](https://pypi.org/project/pyarrow/) library. The loader reads in a CSV with records for over 91,000 dams in the United States from the [National Inventory of Dams](https://nid.sec.usace.army.mil/#/), selects several columns, then writes the data frame as a parquet file to standard output. The data loader lives in [`src/data/us-dams.parquet.py`](./src/data/us-dams.parquet.py).
allisonhorst marked this conversation as resolved.
Show resolved Hide resolved
3 changes: 3 additions & 0 deletions examples/loader-python-to-parquet/observablehq.config.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
export default {
root: "src"
};
20 changes: 20 additions & 0 deletions examples/loader-python-to-parquet/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"type": "module",
"private": true,
"scripts": {
"clean": "rimraf src/.observablehq/cache",
"build": "rimraf dist && observable build",
"dev": "observable preview",
"deploy": "observable deploy",
"observable": "observable"
},
"dependencies": {
"@observablehq/framework": "^1.7.0"
},
"devDependencies": {
"rimraf": "^5.0.5"
},
"engines": {
"node": ">=18"
}
}
2 changes: 2 additions & 0 deletions examples/loader-python-to-parquet/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
pandas==2.2.0
pyarrow==16.1
1 change: 1 addition & 0 deletions examples/loader-python-to-parquet/src/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/.observablehq/cache/
18 changes: 18 additions & 0 deletions examples/loader-python-to-parquet/src/data/us-dams.parquet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Load libraries (must be installed in environment)
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import sys

df = pd.read_csv("https://nid.sec.usace.army.mil/api/nation/csv", low_memory=False, skiprows=1).loc[:, ["Dam Name", "Primary Purpose", "Primary Dam Type", "Hazard Potential Classification"]]

# Write DataFrame to a temporary file-like object
buf = pa.BufferOutputStream()
table = pa.Table.from_pandas(df)
pq.write_table(table, buf)

# Get the buffer as a bytes object
buf_bytes = buf.getvalue().to_pybytes()

# Write the bytes to standard output
sys.stdout.buffer.write(buf_bytes)
64 changes: 64 additions & 0 deletions examples/loader-python-to-parquet/src/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Python data loader to generate Apache Parquet

Here’s a Python data loader that accesses records for over 91,000 dams from the [National Inventory of Dams](https://nid.sec.usace.army.mil/#/), limits the data to only four columns, then outputs an Apache Parquet file to standard out.
allisonhorst marked this conversation as resolved.
Show resolved Hide resolved

```python
# Load libraries (must be installed in environment)
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import sys

df = pd.read_csv("https://nid.sec.usace.army.mil/api/nation/csv", low_memory=False, skiprows=1).loc[:, ["Dam Name", "Primary Purpose", "Primary Dam Type", "Hazard Potential Classification"]]

# Write pandas DataFrame to a temporary object
buf = pa.BufferOutputStream()
table = pa.Table.from_pandas(df)
pq.write_table(table, buf)

# Get the buffer as a bytes object
buf_bytes = buf.getvalue().to_pybytes()

# Write the bytes to standard output
sys.stdout.buffer.write(buf_bytes)
```

<div class="note">

To run this data loader you’ll need python3, and the `pandas` and `pyarrow` libraries, installed and available in your environment. We recommend setting up a virtual environment, for example using:

- `$ python3 -m venv .venv`
- `$ source .venv/bin/activate`

Then install the required modules

- `$ pip install -r requirements.txt`

</div>

The above data loader lives in `data/us-dams.parquet.py`, so we can load the data using `data/us-dams.parquet`. The `FileAttachment.parquet` method parses the file and returns a promise to an Apache Arrow table.

```js echo
const dams = FileAttachment("./data/us-dams.parquet").parquet();
```

We can display the table using `Inputs.table`.

```js echo
Inputs.table(dams)
```

Lastly, we can pass the table to `Plot.plot` to make a simple bar chart of dam counts by purpose, with color mapped to hazard classification.
allisonhorst marked this conversation as resolved.
Show resolved Hide resolved

```js echo
Plot.plot({
marginLeft: 220,
color: {legend: true, domain: ["Undetermined", "Low", "Significant", "High"]},
marks: [
Plot.barX(dams,
Plot.groupY({x: "count"}, {y: "Primary Purpose", fill: "Hazard Potential Classification", sort: {y: "x", reverse: true}
})
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please prettier this. 🙏

Also, you can use sort: {y: "-x"} to shorten.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep will do!

]
})
```