Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update penguin classification to technique example (Py data loader) #1461

Merged
merged 1 commit into from
Jun 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
- [`loader-julia-to-txt`](https://observablehq.observablehq.cloud/framework-example-loader-julia-to-txt/) - Generating TXT from Julia
- [`loader-parquet`](https://observablehq.observablehq.cloud/framework-example-loader-parquet/) - Generating Apache Parquet files
- [`loader-postgres`](https://observablehq.observablehq.cloud/framework-example-loader-postgres/) - Loading data from PostgreSQL
- [`loader-python-to-csv`](https://observablehq.observablehq.cloud/framework-example-penguin-classification/) - Generating CSV from Python
- [`loader-python-to-png`](https://observablehq.observablehq.cloud/framework-example-loader-python-to-png/) - Generating PNG from Python
- [`loader-python-to-zip`](https://observablehq.observablehq.cloud/framework-example-loader-python-to-zip/) - Generating ZIP from Python
- [`loader-r-to-csv`](https://observablehq.observablehq.cloud/framework-example-loader-r-to-csv/) - Generating CSV from R
Expand Down Expand Up @@ -75,7 +76,6 @@
- [`google-analytics`](https://observablehq.observablehq.cloud/framework-example-google-analytics/) - A Google Analytics dashboard with numbers and charts
- [`hello-world`](https://observablehq.observablehq.cloud/framework-example-hello-world/) - A minimal Framework project
- [`intersection-observer`](https://observablehq.observablehq.cloud/framework-example-intersection-observer/) - Scrollytelling with IntersectionObserver
- [`penguin-classification`](https://observablehq.observablehq.cloud/framework-example-penguin-classification/) - Logistic regression in Python; validating models with Observable Plot
- [`responsive-iframe`](https://observablehq.observablehq.cloud/framework-example-responsive-iframe/) - Adjust the height of an embedded iframe to fit its content

## About these examples
Expand Down
9 changes: 9 additions & 0 deletions examples/loader-python-to-csv/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[Framework examples →](../)

# Python data loader to generate CSV

View live: <https://observablehq.observablehq.cloud/framework-example-penguin-classification/>

This Observable Framework example demonstrates how to write a data loader in Python to generate a CSV file. The data loader uses scikit-learn’s [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) function to classify [penguins](https://journal.r-project.org/articles/RJ-2022-020/) by species, based on body mass, culmen and flipper measurements. Charts (made with Observable Plot) explore which penguins are misclassified.

The data loader lives in [`src/data/predictions.csv.py`](./src/data/predictions.csv.py).
85 changes: 85 additions & 0 deletions examples/loader-python-to-csv/src/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Python data loader to generate CSV

Here’s a Python data loader that performs logistic regression to classify penguin species based on bill and body size measurements, then outputs a CSV file to standard out.

```python
import pandas as pd
from sklearn.linear_model import LogisticRegression
import sys

# Read the CSV
df = pd.read_csv("src/data/penguins.csv")

# Select columns to train the model
X = df.iloc[:, [2, 3, 4, 5]]
Y = df.iloc[:, 0]

# Create an instance of Logistic Regression Classifier and fit the data.
logreg = LogisticRegression()
logreg.fit(X, Y)

results = df.copy();
# Add predicted values
results['species_predicted'] = logreg.predict(X)

# Write to CSV
results.to_csv(sys.stdout)
```

<div class="note">

To run this data loader, you’ll need python3 and the geopandas, matplotlib, io, and sys modules installed and available on your `$PATH`. We recommend setting up a virtual environment.

</div>

To start and activate a virtual Python environment, run the following commands:

```
$ python3 -m venv .venv
$ source .venv/bin/activate
```

Then install the required modules from `requirements.txt` using:

```
$ pip install -r requirements.txt
```

The above data loader lives in `data/predictions.csv.py`, so we can load the data using `data/predictions.csv` with `FileAttachment`:

```js echo
const predictions = FileAttachment("data/predictions.csv").csv({typed: true});
```

We can create a quick chart of predicted species, highlighting cases where penguins are misclassified, using Observable Plot:

```js echo
Plot.plot({
grid: true,
height: 400,
caption: "Incorrect predictions highlighted with diamonds. Actual species encoded with color and predicted species encoded with symbols.",
color: {
legend: true,
},
x: {label: "Culmen length (mm)"},
y: {label: "Culmen depth (mm)"},
marks: [
Plot.dot(predictions, {
x: "culmen_length_mm",
y: "culmen_depth_mm",
stroke: "species",
symbol: "species_predicted",
r: 3,
tip: {channels: {"mass": "body_mass_g"}}
}),
Plot.dot(predictions, {
filter: (d) => d.species !== d.species_predicted,
x: "culmen_length_mm",
y: "culmen_depth_mm",
r: 7,
symbol: "diamond",
stroke: "currentColor"
})
],
})
```
32 changes: 0 additions & 32 deletions examples/penguin-classification/README.md

This file was deleted.

94 changes: 0 additions & 94 deletions examples/penguin-classification/src/index.md

This file was deleted.