The IMRaD-like Python project templates for data science projects.
In the data science domain projects are sometimes shared as an informal assemblage of scripts. This repository proposes two IMRaD-like layouts that can be used for organizing a data science project. The "Informal IMRaD-like Layout" is a Python project organized into materials
, methods
, and results
directories. The "Formal IMRaD-like Flat Layout" is a conventional installable Python flat-layout project that can be built and distributed as a package and published to PyPI.
The Informal IMRaD-like Layout is a useful and intuitive project layout. It also serves as an introduction to the IMRaD-like layouts for data science projects. It isn't installable; however, its IMRaD-like naming convention makes its organization immediately recognizable to persons working in the science domains.
- A simple IMRaD-like layout.
You can create an informal layout project using the Cookiecutter.
├── project ⬅ This is the project directory.
│
├── materials ⬅ You can put your datasets and models in the materials directory.
│ │
│ └── README.md
│
├── methods ⬅ You can put your utility functions and notebooks in the methods directory.
│ │
│ ├── main.ipynb
│ │
│ └── README.md
│
├── results ⬅ You can put the outputs of your scripts (e.g., tables and visualizations) in the results directory.
│ │
│ └── README.md
│
├── requirements.txt ⬅ You can specify your project requirements in this file.
│
├── README.md ⬅ You can put your Introduction and Discussion in the README.md file.
│
└── .gitignore
pip install cookiecutter
cookiecutter https://github.com/faranalytics/data_science_project.git --checkout informal_layout_cookiecutter
The Formal IMRaD-like Flat Layout project template describes an approach for organizing your data science project using a conventional Python "flat-layout" project layout. It follows formal conventions for packaging a Python project. You install it into your environment just like an ordinary Python package. It consists of a single package with an IMRaD-like layout; it contains materials
, methods
, and results
sub-packages. Project dependencies are specified in the pyproject.toml
file.
One important advantage of this approach is that utility functions can be conveniently imported into notebooks from anywhere in the package. It makes imports seamless without having to modify sys.path
or setting the PYTHONPATH
environment variable.
- An easily recognizable formal Python package layout
- Define your dependencies using formal packing conventions
- Seamless imports from anywhere in your package
- Relative package imports from within notebooks
- Pipeline definitions
You can clone this repository and follow this short tutorial in order to explore the project layout. If you want to start a new project, you can create a project using the Cookiecutter.
In this example you will clone the repository, explore its layout, install it, and run an example notebook.
git clone https://github.com/faranalytics/data_science_project.git
git checkout formal_flat_layout
This is the top-level directory of a conventional Python package.
cd data_science_project/project
This is a conventional flat-layout Python project. The project follows all the conventions of a formal Python project.
├── project ⬅ This is the project directory. Optionally chose a name for your project.
|
├── __about__.py
|
├── LICENSE
|
├── package ⬅ This is the package directory. Optionally give the package a unique name.
| |
│ ├── __init__.py
| |
| ├── __main__.py ⬅ You can define your package's pipeline in the __main__.py module.
| |
│ ├── materials ⬅ You can put your datasets and models in the materials directory.
| | |
│ │ ├── __init__.py
| | |
│ │ └── README.md
| |
│ ├── methods ⬅ You can put your utility functions and notebooks in the methods directory.
| | |
│ │ ├── __init__.py
| | |
│ │ ├── notebooks
| | | |
│ │ │ └── main.ipynb
| | |
│ │ ├── README.md
| | |
│ │ └── utils.py
| |
│ └── results ⬅ You can put the outputs of your scripts (e.g., tables and visualizations) in the results directory.
| |
│ ├── __init__.py
| |
│ └── README.md
|
├── pyproject.toml ⬅ The project is configured to use the Hatch project manager.
|
└── README.md
conda activate <your-environment>
An editable install, also known as a development install, will make changes to your package modules immediately available when you restart your kernel.
pip install -e .
You have installed a Python package named package
. Once you complete the tutorial, you can uninstall it using pip.
pip uninstall package
from package.materials import MATERIALS_PATH
from package.methods import METHODS_PATH
from package.results import RESULTS_PATH
Set the __package__
attribute and use a relative import to import a utility function from the methods
sub-package.
__package__ = "package.methods.notebooks"
from ..utils import say_hello
print(say_hello())
from package.methods.utils import say_hello
print(say_hello())
Read data from the MATERIALS_PATH
, transform it into a list of lists, and write the data to the RESULTS_PATH
and print it to the notebook output cell.
from pprint import pprint
import pickle
# Read `iris.data` from the MATERIALS_PATH.
data = [
line.strip().split(",")
for line in open(MATERIALS_PATH.joinpath("iris/iris.data")).readlines()[:-1]
]
# Write the `iris.data.pkl` table to RESULTS_PATH.
pickle.dump(data, open(RESULTS_PATH.joinpath("iris.data.pkl"), "wb"))
pprint(data)
The example project contains a pipeline defined in __main__.py
. You can run the pipeline by running the installed package
module. It uses the papermill package to run the contents of /project/package/methods/notebooks/main.ipynb
. It prints the first 10 lines of the iris dataset to the console.
python -m package
You can use the Cookiecutter package to create a customized instance of The Data Science Project.
pip install cookiecutter
cookiecutter https://github.com/faranalytics/data_science_project.git --checkout cookiecutter
[1/6] project_name (project): project
[2/6] package_name (package): package
[3/6] repository_url (https://github.com/pypa/sampleproject): https://github.com/pypa/sampleproject
[4/6] package_description (A small example package): A small example package
[5/6] author_name (Example Author): Example Author
[6/6] author_email (author@example.com): author@example.com
You can give your project and package the same name.
This is the top-level directory of a conventional Python package.
cd <my_project_name>
conda activate <your-environment>
An editable install, also known as a development install, will make changes to your package modules immediately available when you restart your kernel.
pip install -e .
You can add dependencies to your project by modifying the dependencies
section of the pyproject.toml
.
You can include the pandas
package, for example, by adding it to the list of dependencies
.
pyproject.toml
...
dependencies = [
"hatch",
"ipykernel",
"pandas>=2, <3"
]
...
pip install -e .
You can use __main__.py
in order to define your project's pipeline. Once your package is installed and your pipeline is defined in your __main__.py
module, you can run your package's pipeline using the -m
option.
python -m <your-package-name>
The __main__.py
module in this repository shows how you can use papermill to easily construct a notebook pipeline.
You can publish your package by following the instructions in the tutorial. Alternatively, you can use the Hatch CLI tool in order to build and publish your project.