Template project structure for R&D work primarily for client projects, based on the widely used concept of cookiecutter-data-science.
Hugely updated over the years, now up to date with Oreum Industries' current preferred best practices, packages, installations, structures etc.
To re-use: replace string oreum_template
with your project_name
.
- Project Description, Scope, Directory Structure
- How to Install and Run on a Local Developer Machine
- Code Standards
- Notebook Standards
- Data Standards
- General Notes
This is an initial implementation of an internal project by Oreum Industries - an implementation of copula-based Expected Loss Cost forecasting.
This project is:
- A work in progress (v0.y.z) and liable to breaking changes and inconveniences to the user
- Solely designed for ease of use and rapid development by employees of Oreum Industries, and selected clients with guidance
This project is not:
- Intended for public usage and will not be supported for public usage
- Intended for contributions by anyone not an employee of Oreum Industries, and unsolicited contributions will not be accepted
- Project began on 2022-01-01
- The
README.md
is MacOS and POSIX oriented - See
LICENCE.md
for licensing and copyright details - See
pyproject.toml
for authors, package dependencies etc - For code repository access see GitHub
- Implementation:
- This project is enabled by a modern, open-source, advanced software stack for data curation, statistical analysis and predictive modelling
- This project is end-to-end, fully reproducible data science solutions: via notebooks, scripts, CLI & API, automated environment & package management, continuous integration, version control and rich documentation
- Specifically we use an open-source Python-based suite of software packages, the core of which is often known as the Scientific Python stack, supported by NumFOCUS
- Once installed (see section 2), see
LICENSES_THIRD_PARTY.md
for full details of all package licences
- Environments: this project was originally developed on a Macbook Air M2
(Apple Silicon ARM64) running MacOS 14.7 (Sonoma) using
osx-arm64
Accelerate
The repo is structured for R&D usage. The major items to be aware of are:
/
↳ dotfiles - various dotfiles to configure the repo
↳ Makefile - recipes to build the dev env
↳ README.md - this readme file
↳ LICENSE.md - licensing and copyright details
↳ assets/ - non-code based images and external pdfs
↳ data/ - placeholder for data files (to be managed via Git LFS)
↳ notebooks/ - Jupyter Notebooks
↳ plots/ - output plots saved as images
↳ sql/ - SQL files
↳ src/ - Python modules
↳ config/ - configs if used
↳ dataprep/ - data transforms / feature engineering pre-model
↳ engine/ - classes to operate models
↳ model/ - classes define statistical models
For local development on MacOS.
- Install Homebrew, see instructions at https://brew.sh
- Install
direnv
,git
,git-lfs
,graphviz
,zsh
$> brew update && brew upgrade
$> brew install direnv git git-lfs graphviz zsh
Assumes direnv
, git
, git-lfs
, graphviz
and zsh
installed as above
$> git clone https://github.com/oreum-industries/oreum_template
$> cd oreum_template
Then allow direnv
on MacOS to automatically run file .envrc
upon directory open
Notes:
- We use
conda
virtual envs controlled bymamba
(quicker thanconda
) - We install packages using
miniforge
(sourced from theconda-forge
repo) wherever possible and only usepip
for packages that are handled better bypip
and/or more up-to-date on pypi - Packages might not be the very latest because we want stability for
pymc
which is usually in a state of development flux - See cheat sheet of conda commands
- The
Makefile
creates a dev env and will also download and preinstallminiforge
if not yet installed on your system
From the dir above oreum_template/
project dir:
$> make -C oreum_template dev
This will also create some files to help confirm / diagnose successful installation:
dev/install_log/blas_info.txt
for theBLAS MKL
installation fornumpy
dev/install_log/pipdeptree[_rev].txt
lists installed package deps (and reversed)LICENSES_THIRD_PARTY.md
details the license for each package used
$> make test-dev-env
This will also add files dev/install_log/[tests_numpy|test_scipy].txt
which
detail successful installation (or not) for numpy
, scipy
NOTE the quotes required by zsh
$> pip install ".[plots]"
From the dir above oreum_template/
project dir:
$> make -C oreum_template uninstall-env
We use pre-commit to run a suite of automated tests for code linting & quality control and repo control prior to commit on local development machines.
- This is installed as part of
make dev
which you already ran. - See
.pre-commit-config.yaml
for details
We use Github Actions aka Workflows to run a suite of automated tests for commits received at the origin (i.e. GitHub)
- See
.github/workflows/*
for details
We use Git LFS to store any large files alongside the repo. This can be useful to replicate exact environments during development and/or for automated tests
- This requires a local machine install (see Getting Started)
- See
.gitattributes
for details
Some notes to help configure local development environment
[user]
name = <YOUR NAME>
email = <YOUR EMAIL ADDRESS>
We strongly recommend using VSCode for all
development on local machines, and this is a hard pre-requisite to use
the .devcontainer
environment (see section 3)
This repo includes relevant lightweight project control and config in:
oreum_template.code-workspace
.vscode/extensions.json
.vscode/settings.json
Even when writing R&D code, we strive to meet and exceed (even define) best practices for code quality, documentation and reproducibility for modern data science projects.
We use a suite of automated tools to check and enforce code quality. We indicate the relevant shields at the top of this README. See section 1.4 above for how this is enforced at precommit on developer machines and upon PR at the origin as part of our CI process, prior to master branch merge.
These include:
ruff
- extremely fast standardised linting and formatting, which replacesblack
,flake8
,isort
interrogate
- ensure complete Python docstringsbandit
- test for common Python security issues
We also run a suite of general tests pre-packaged in
precommit
.
Where suitable, we break out commonly used functions and classes to module files
under the src/
directory - this gives clear, convenient and easier code
control than when it's embedded inside notebooks. Note for clarity, that we
don't compile this code or release separately to the project.
General best practices for naming / ordering / structure.
Every Notebook is:
- Fully executable end-to-end, with linear non-cyclic flow
- Living documentation with extensive text and plot-based explanation
- Named starting with a 3-digit reference with group-based ordering to
indicate logical flow and dependencies, e.g:
000
series: Overview, discussion, presentational documents100
series: Data Curation200
series: Exploratory Data Analysis300
series: Model Architecture and Data Transformations400
series: Model Design, Development, Evaluation and Inference500
series: Model Finalisation for Production Use600
,700
,800
series: used for specific extensions if needed900
series: Demos, Notes, Worked Explanations.
Live Notebooks are:
- Present in the
/notebooks
directory - Part of the final R&D project flow, and required in order to reproduce the eventual findings & observations
- Guaranteed to be up to date with the latest code in
src/
.
Rendered Notebooks are:
- Present as rendered PDFs or
reveal.js
Slides innotebooks/renders/
- Created somewhat as-needed for offline print-based discussion with stakeholders.
We use nbconvert
to render to PDF or reveal.js HTML slides using configs.
From inside the notebooks/
dir, run:
$> jupyter nbconvert --config renders/config_pdf.py
$> jupyter nbconvert --config renders/config_slides.py
Archived Notebooks are:
- Present in
notebooks/archive/
- No longer required, but kept around for historical audit, discussion, code examples
- May have fallen behind the latest local code and/or methods.
See data/README_DATA.md
IMPORTANT NOTE on terminology / naming convention and dataset partitioning based on the information present and dataset usage.
Dataset terminology / partitioning / purpose:
|<---------- Relevant domain of all data for our analyses & models ---------->|
|<----- "Observed" historical target ------>||<- "Unobserved" future target ->|
|<----------- "Working" dataset ----------->||<----- "Forecast" dataset ----->|
|<- Training/CrossVal ->||<- Test/Holdout ->|
-
The "Observed" historical target dataset has:
-
a known exogenous (target) feature value
-
known endogenous feature values to allow model regression
-
a hypothetical structure that we use to design the model
-
The "Working" dataset is the same as this "Observed" data, and may be split into:
- A Training/CrossVal set used to fit the model. This may be partitioned into multiple Cross-Validation sets if required by the model architecture and fitting process
- A Test/Holdout used to evaluate the model fit against a known target
- We can use this Working set in full when fitting the final model for Production, because this yields the most performant model
-
-
The "Unobserved" future target dataset has:
-
an unknown exogenous (target) feature value
-
known endogenous feature values to allow model regression
-
a hypothetical structure that we use to design the model
-
The "Forecast" dataset is the same as this "Unobserved" data, and is generally what we will try to predict upon in Production
- We might create predictions for individual datapoints or in bulk
- If the entities in the data evolve over time (e.g. a set of policies each with evolving premium payments and claim developments), and if the endogenous features don't evolve with time (they are static not dynamic) then we can artificially create a Forecast dataset by extending the Working dataset forward in time.
-
Further note:
-
We may refer to "In-Sample" and "Out-of-Sample" datasets. The former is the data used to train the model and the latter to evaluate the model against a known exogenous (target) value or forecast an unknown exogenous (target) value. So they can be used during Working or Forecasting.
-
Strictly speaking, our Bayesian modeling workflow does not require us to evaluate the model on a Test/Holdout set because we can use in-sample Pareto-smoothed Leave-One-Out (LOO-PIT) cross-validation testing. This is more powerful, and lets us fit & evaluate the model using the full Working set.
-
However, purely to aid reader comprehension and demonstrate the out-of-sample prediction workflow, we may use the practice of a known Test/Holdout set.
We aim to make this project usable by all (developer, statistician, biz):
- Logical structuring of code files with modularization and reusability
- Small purposeful classes with abstracted object inheritance, and terse single-purpose functions
- Variable and data parameterization throughout and use of config files to inject globals
- Informative naming for classes / functions / variables / data, and human-readable, well-linted code
- Specific and general error handling
- Logging with rotation / archival
- Detailed docstrings and type-hinting
- Inline comments to explain complicated code / concepts to developers
- Adherence to a consistent style guide and syntax, and use of linters
- Well-organized Notebooks with logical ordering and “run-all” internal flow, and plenty of explanatory text and commentary to guide the reader
- Use of virtual environments and/or containers
- Build scripts for continuous integration and deployment
- Unit tests and automated test scripts
- Documentation to allow full reproducibility and maintenance
- Commits have meaningful messages and small, iterative, manageable diffs to allow code review
- Adherence to conventional branching structures, management of stale branches
- Merges into master managed via pull requests (PRs) comprised of specific commits, and the PR linked to specific issue tickets
- PRs setup to trigger manual code reviews and automated hooks to code formatting, unit testing, continuous integration (inc. automated integration and regression testing) and continuous deployment
- New releases managed with tagging, fixed binaries, changelogs
Copyright 2022 Oreum OÜ t/a Oreum Industries. All rights reserved. See LICENSE.md.
Oreum OÜ t/a Oreum Industries, Sepapaja 6, Tallinn, 15551, Estonia, reg.16122291, oreum.io
Oreum OÜ © 2022