Skip to content

A quick guide to demystify the use of Makefiles in Data Science

Notifications You must be signed in to change notification settings

mapsa/makefile-examples

Repository files navigation

Makefile Essentials for Data Science Projects

A set of notes and Makefiles examples.

Table of Contents

  1. Uses
  2. Basic Concepts
  3. Special Targets
  4. Automatic Variables
  5. Text Functions
  6. Execution
  7. Debugging
  8. More Elegant Options
  9. Standard Targets
  10. Non-standard Targets
  11. Examples
  12. References

Uses

  1. Reproducible Research: useful for sharing a complete analysis (code, data, workflows, report) with collaborators and readers of a final article.
  2. Task Dependency Management: Make determines which targets needs to be rebuilt based on their dependencies changes. Therefore, you can save time avoid running the entire pipeline after a change.
  3. Pipeline Documentation: By explicitly recording the inputs to and outputs from steps in the analysis and the dependencies between files, Makefiles act as a type of documentation, reducing the number of things we have to remember.

Basic Concepts

Make is a build automation tool to build targets based on recipes:

  1. Targets: what to build (a file or a phony target)
  2. Rules: how to build the target
  3. Prerequisites (optional): dependencies
index.html: dashboard.py stats.csv
<tab>   python dashboard.py stats.csv
<tab>   echo index created

stats.csv: stats.py data.csv
<tab>   python stats.py data.csv

To perform a build, make will construct a direct acyclic graph (DAG) from the rules.

graph BT;
    dashboard.py --> index.html;
    stats.csv --> index.html;
    stats.py --> stats.csv
    data.csv --> stats.csv
Loading

By default, when you type make it will try to find a Makefile with the following names, in order: GNUmakefile, makefile and Makefile (the most common one).

You can also call it differently but you need to run it as make -f mymakefile.

.PHONY

The prerequisites of the special target .PHONY are considered to be phony targets. When it is time to consider such a target, make will run its recipe unconditionally, regardless of whether a file with that name exists or what its last-modification time is.

.PHONY: all target1 target2 target3 clean

OUTDIR = output

all: target1 target2 

target1: prerequisite1
<tab>   command_A

target2: prerequisite1
<tab>   command_B

clean:
<tab>   rm -rf $(OUTDIR)

.EXPORT_ALL_VARIABLES

Simply by being mentioned as a target, this tells make to export all variables to child processes by default.

.DELETE_ON_ERROR

Delete the target of a rule if it has changed and its recipe exits with a nonzero exit status.

.ONESHELL

When a target is built all lines of the recipe will be given to a single invocation of the shell.

.DEFAULT_GOAL

By default, the goal is the first target in the makefile, you can use DEFAULT_GOAL to change this behaviour.

$@

The file name of the target of the rule.

target1: prerequisite1
<tab>   echo $@

Will print target1.

$<

The name of the first prerequisite.

target1: prerequisite1 prerequisite2
<tab>   echo $<

Will print prerequisite1.

$*

The stem with which an implicit rule matches.

$(OUTDIR)/my_%_file.csv: prerequisite1 
<tab>   echo $*

If in the folder OUTDIR you have a csv file called my_first_file.csv, this will print first.

Wildcards

CSVS = $(wildcard *.csv)

String Substitution

Remember not adding spaces between commas:

$(subst apples,oranges,I love apples)

Pattern Substitution

INPUTDIR = data
OUTPUTDIR = output
CSVS = $(wildcard $(DATA)/*.csv)
INPUTFILES = $(CSVS:%.csv=$(INPUTDIR)/%.csv)
OUTPUTFILES = $(CSVS:%.csv=$(OUTPUTDIR)/%.csv)

which is equivalent to:

INPUTDIR = csv
CSVS = $(wildcard *.csv)
INPUTFILES = $(patsubst %.csv,$(INPUTDIR)/%.csv,$(CSVS))

Parallel Execution

You can use -j to run in parallel (limited to number of CPUs and RAM available) or specify the number of parallel processes N.

make -j
make -j N

Always make

Forces make to ignore existing targets

make target1 -B

Keep Going

Continue as much as possible after an error.

make target1 -k

Debugging

Print a variable

$(info $(MYVAR))

Dry run: Use the "just print" option

make -n

or combine it with the always make option

make -Bn

More Elegant Options

  • Use @ before a command to suppress its output
  • Define your programs as variables
PYTHON = @python3
R = @Rscript

target1:
<tab>   $(R) myscript.R

target2:
<tab>   $(PYTHON) myscript.python
  • all: Make all the top-level targets the makefile knows about.
  • clean: Delete all files that are normally created by running make.
  • install: this generally copy the executable file into a directory that users typically search for commands.
  • test: Perform self tests on the program this makefile builds.

Non-standards Targets

  • venv: creates a virtual environment

  • help: it might be usefult to achieve a self-documented Makefile.

    .PHONY: help
    help:
    <tab>   @echo Run a simulation and generate a report
    <tab>   @echo sim         : run only the simulation
    <tab>   @echo report      : generate a report
    <tab>   @echo clean       : delete simulation and report
  • variables: you could also create a target to print variables.

    .PHONY : variables
    variables:
    <tab>   @echo INPUT_DIR: $(INPUT_DIR)
    <tab>   @echo CSV_FILES: $(CSV_FILES)

    Examples

  • 01-includes: this example shows the use of includes to manage a set of scenarios as configuration files.

  • 02-quarto-params: running a quarto document accepting params defined in the Makefile.

  • 03-quarto-slides: creating slides as pdf and powerpoint from a quarto document.

  • 04-latex: compile a LaTeX document.

  • 05-functions: how to create targets dinamically using define.

  • 06-conda: create and activate conda environments

  • 07-help: how to document makefiles

  • 08-aws: useful targets to deal with AWS credentials and S3

References

Note

A website view of this repo can be seen here. The repo is available here.

About

A quick guide to demystify the use of Makefiles in Data Science

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published