Skip to content

Commit

Permalink
Merge pull request #18 from NICD-UK/script-templates
Browse files Browse the repository at this point in the history
Script templates
  • Loading branch information
m-misiura authored Mar 2, 2023
2 parents ea7f7f4 + 4ba1aba commit 31b4998
Show file tree
Hide file tree
Showing 32 changed files with 264 additions and 179 deletions.
114 changes: 66 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Project Template

## Usage
## Setup

To use the project template:

Expand All @@ -9,7 +9,7 @@ pip install cookiecutter
cookiecutter https://github.com/NICD-UK/project-template
```

You will be prompted for eleven inputs:
You will be prompted for the following answers:

1. Project Name
2. Project Directory Name
Expand All @@ -18,75 +18,93 @@ You will be prompted for eleven inputs:
5. Project Sponsor Name
6. Project Sponsor Email
7. Project Summary
8. Raw Data Directory
9. Language (Python / R)
10. `venv` Project (No / Yes)
11. `git` Project (No / Yes)
8. <a name="language">Language</a>: **Python** or **R**

## Organization
Then run:

```
make
```

This command will:

1. Initialise a virtual environment
- `venv` for Python
- `renv` for R
2. Install the packages required for the template scipts
3. Save the packages to a dependencies file
- `requirements.txt` for Python
- `renv.lock` for R
4. Initialise a git repository

## Package Management

To install a package in Python run:

```
venv/bin/pip install <package>
```

To install a package in R use the Packages tab in RStudio.

To save the installed packages to the dependencies file run:

```
make save
```

To load the packages from the dependencies file run:

```
make load
```

## Project Structure

The project has the following structure:

```
README.md
config.yml
data/
├─ clean/
├─ model/
├─ raw/
├─ wrangle/
models/
presentations/
reports/
├─ clean/
├─ final/
├─ wrangle/
src/
├─ clean/
├─ model/
├─ wrangle/
```

## Data Science Workflow

### 1. Business Understanding

- **Determine Objectives:**
- **Determine Deliverables:**
- **Determine Resources:**
- **Plan Project:**

### 2. Data Preparation and Understanding

- **Import Data:**
- **Clean Data:**
- **Wrangle Data:**

### 3. Prototyping

- **Develop Data Product:**
- **Evaluate Data Product:**
- **Approve Data Product:**
## Project Charter

### 4. Production
The `README.md` file is the [Project Charter](https://en.wikipedia.org/wiki/Project_charter). The head of the project charter includes: the project name; the name and email of the project manager; and the name and email of the project sponsor. This is filled out with the answers to the corresponding prompts during setup. The body of the project charter includes:

- **Deploy Data Product:**
- **Monitor Data Product:**
- **Maintain Data Product:**
- **Close Project:**
- Summary
- Objectives
- Deliverables
- Resources
- Scope
- Costs and Benefits
- Risks and Contingencies

## Guide
The body of the project charter is filled out during the project scoping phase.

### Clean Data
## Script Templates

![](figures/clean.drawio.svg)
There are template scripts for:

1. Create a cleaning script in the `src/clean` directory that imports and cleans the raw data from the `data/raw` directory and writes to the `data/clean/` directory.
2. The cleaned data is stored in the `data/clean/` directory.
3. Create a cleaning report in the `report/clean/` directory that reads the cleaned data from the `data/clean/` directory.
4. The cleaning report in the `report/clean/` directory is used to update the cleaning script in the `src/clean/` directory.
1. cleaning data in `src/clean/`,
2. describing data in `reports/clean/`,
3. wrangling data in `src/wrangle/`,
4. exploring data in `reports/wrangle`

### Wrangle Data
available in [Python](https://www.python.org) or [R](https://www.r-project.org). Answer **Python** or **R** to the [Language](#language) prompt during setup for the relevant template scripts. All template scripts include code to read from and write to the appropriate data directories. The template scripts for describing and exploring data generate reports for the cleaned and wrangled data, respectively. There is also a template script for presenting data in `presentations/` available in [Quarto](https://quarto.org).

![](figures/wrangle.drawio.svg)
## Recommendations

1. Create a wrangling script in the `src/wrangle` directory that reads and wrangles the clean data from the `data/clean/` directory and writes to the `data/wrangle/` directory.
2. The wragled data is stored in the `data/wrangle/` directory.
3. Create a wrangling report in the `report/wrangle/` directory that reads the wrangled data from the `data/wrangle/` directory.
4. The wrangling report in the `report/wrangle/` directory is used to update the wrangling script in the `src/wrangle/` directory.
For the best experience it is recommended to use the project template with [Visual Studio Code](https://code.visualstudio.com) for Python projects and [RStudio](https://posit.co/products/open-source/rstudio/) for R projects.
5 changes: 1 addition & 4 deletions cookiecutter.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,5 @@
"project_sponsor_name": "Project Sponsor Name",
"project_sponsor_email": "Project Sponsor Email",
"project_summary": "Project Summary",
"raw_data_directory": "data/raw",
"language": ["Python", "R"],
"venv_project": ["No", "Yes"],
"git_project": ["No", "Yes"]
"language": ["Python", "R"]
}
22 changes: 4 additions & 18 deletions hooks/post_gen_project.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,33 +2,19 @@
import glob
import os

venv_project = "{{cookiecutter.venv_project}}"
git_project = "{{cookiecutter.git_project}}"
language = "{{cookiecutter.language}}"

# create Python project
if language == "Python":
os.remove("MakefileR")
os.rename("MakefilePython", "Makefile")
os.remove("{{cookiecutter.project_directory_name}}.Rproj")
for file in glob.glob("**/*.Rmd", recursive=True):
os.remove(file)

# create R project
if language == "R":
os.remove("MakefilePython")
os.rename("MakefileR", "Makefile")
for file in glob.glob("**/*.py", recursive=True):
os.remove(file)

# create venv project
if venv_project == "Yes":
subprocess.run(["python3", "-m", "venv", ".venv"], stdout=subprocess.DEVNULL)
subprocess.run([".venv/bin/python", "-m", "pip", "install", "--upgrade", "pip"], stdout=subprocess.DEVNULL)

# gitignore config.yml
with open(".gitignore", "a") as f:
lines = ["\n", "# configuration file\n", "config.yml\n"]
f.writelines(lines)

# create git project
if git_project == "Yes":
subprocess.run(["git", "init"], stdout=subprocess.DEVNULL)
subprocess.run(["git", "add", "--all"], stdout=subprocess.DEVNULL)
subprocess.run(["git", "commit", "-m", "'initial commit'"], stdout=subprocess.DEVNULL)
15 changes: 10 additions & 5 deletions {{cookiecutter.project_directory_name}}/.gitignore
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
# .venv directory
/.venv/
# venv directory
/venv/*
/renv/*
!renv/activate.R

# data directory
/data/clean/*
/data/model/*
/data/raw/*
/data/wrangle/*

# notebooks directory
/notebooks/*
# models directory
/models/*

# directory structure
!.gitkeep

# presentation files
/presentations/*.html
/presentations/*_files/
26 changes: 26 additions & 0 deletions {{cookiecutter.project_directory_name}}/MakefilePython
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.PHONY: all venv save load git

#################################################################################
# COMMANDS #
#################################################################################

all: venv save git

venv:
python3 -m venv venv
venv/bin/pip install --upgrade pip
venv/bin/pip install ipykernel
venv/bin/pip install pandas
venv/bin/pip install pathlib
venv/bin/pip install ydata-profiling

save:
venv/bin/pip freeze > requirements.txt

load:
venv/bin/pip install -r requirements.txt

git:
git init
git add --all
git commit -m "initial commit"
26 changes: 26 additions & 0 deletions {{cookiecutter.project_directory_name}}/MakefileR
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.PHONY: all venv save load git

#################################################################################
# COMMANDS #
#################################################################################

all: venv save git

venv:
Rscript -e 'install.packages("renv", repos = "https://cloud.r-project.org/")'
Rscript -e 'renv::init(bare = TRUE)'
Rscript -e 'renv::install("dlookr")'
Rscript -e 'renv::install("glue")'
Rscript -e 'renv::install("here")'
Rscript -e 'renv::install("readr")'

save:
Rscript -e 'renv::snapshot()'

load:
Rscript -e 'renv::restore()'

git:
git init
git add --all
git commit -m "initial commit"
2 changes: 0 additions & 2 deletions {{cookiecutter.project_directory_name}}/config.yml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
title: "Presentation"
author: "{{cookiecutter.project_manager_name}}"
format: revealjs
---

## Introduction
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Load Libraries
```{r message=FALSE}
library(dlookr)
library(glue)
library(here)
library(readr)
```

# Setup
```{r}
data_name <- "<data-name>"
```

# Read Data
```{r}
clean_data <- read_rds(here(glue("data/clean/{data_name}.rds")))
```

# Describe Data
```{r}
diagnose_web_report(clean_data)
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#%% Load Libraries
import pandas
from pathlib import Path
from ydata_profiling import ProfileReport

#%% Setup
root_path = Path(__file__).parent.parent.parent
data_name = "<data-name>"

#%% Read Data
clean_data = pandas.read_pickle(root_path / f"data/clean/{data_name}.pkl")

#%% Describe Datadata
profile = ProfileReport(clean_data, title="Description Report")
profile.to_notebook_iframe()
10 changes: 0 additions & 10 deletions {{cookiecutter.project_directory_name}}/reports/clean/clean.py

This file was deleted.

Empty file.
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Load Libraries
```{r message=FALSE}
library(dlookr)
library(glue)
library(here)
library(readr)
```

# Setup
```{r}
data_name <- "<data-name>"
```

# Read Data
```{r}
wrangle_data <- read_rds(here(glue("data/wrangle/{data_name}.rds")))
```

# Explore Data
```{r}
eda_web_report(wrangle_data)
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#%% Load Libraries
import pandas
from pathlib import Path
from ydata_profiling import ProfileReport

#%% Setup
root_path = Path(__file__).parent.parent.parent
data_name = "<data-name>"

#%% Read Data
wrangle_data = pandas.read_pickle(root_path / f"data/wrangle/{data_name}.pkl")

#%% Explore Data
profile = ProfileReport(wrangle_data, title="Exploration Report")
profile.to_notebook_iframe()
Loading

0 comments on commit 31b4998

Please sign in to comment.