Skip to content

cristobalcl/datacker

Repository files navigation

datacker

Release Tests codecov Python 3.7 Python 3.8 Code style: black

Convert your notebooks to runnable Docker images. The quickest way to bring Data Scientists work to production.

Introduction

Datacker creates Docker images from one or more Jupyter Notebooks. You also can add a requirements.txt with the code dependencies. The result Docker image can execute the notebook by itself, and store the new notebook in a directory that can be bind mounted to a directory in the host machine for persisting (in next versions, the result can be stored in the cloud, in S3 for example). Parameters can be passed to the notebook dynamically.

Install

$ pip install datacker

Usage

$ datacker --help
Usage: datacker [OPTIONS] IMAGE_NAME NOTEBOOKS...

Arguments:
  IMAGE_NAME    [required]
  NOTEBOOKS...  [required]

Options:
  -r, --requirements TEXT         Path to requirements file
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.

  --help                          Show this message and exit.

Example

Using the example from examples/parameters:

Build the Docker image:

datacker datacker_parameters pie.ipynb -r requirements.txt

Run:

docker run --env NOTEBOOK_NAME=pie \
  --env PARAMETERS='{ "sizes": [40, 10, 20, 30], "explode": [0.1, 0, 0, 0] }' \
  --env PARAM_labels='["Cat", "Cactus", "Cattle", "Camel"]' \
  --mount type=bind,src=${PWD}/output,dst=/output \
  datacker_parameters

The name of the notebook is passed in the environment variable NOTEBOOK_NAME.

This example shows two ways for passing parameters to the notebooks: using the environment variable PARAMETERS, that accepts a JSON with the parameters; and the other way is using environment variables with a name like PARAM_[var_name], that accepts values with a Python representation: float as '3.1415', string as '"Hello World!"', and so on. Variables defined with PARAM_[var_name] have higher priority.

The results will be stored in the output directory on the host.

Parameterizing a Notebook

Before build a Datacker image you need to setup your notebook, if you want to use parameters when running. You need to mark a cell with the tag parameters. This cell will have the variables with its default values. Look at the notebooks in the examples directory.

To know how to add the tag to a cell check How should I add cell tags and metadata to my notebooks?.

Roadmap

  • Store results in the cloud (S3, Azure,...).
  • Option to send the result of the execution to stdout as markdown.
  • Deploying to Kubernetes.
  • Documentation.