Run R scripts with pytask.
pytask-r is available on PyPI and Anaconda.org. Install it with
$ pip install pytask-r
# or
$ conda install -c conda-forge pytask-r
You also need to have R installed and Rscript
on your command line. Test it by typing
the following on the command line
Rscript --help
If an error is shown instead of a help page, you can install R with conda
.
conda install -c conda-forge r-base
Or install install R from the official R Project.
To create a task that runs an R script, define a task function with the @mark.r
decorator. The script
keyword provides an absolute path or a path relative to the task
module.
from pathlib import Path
from pytask import mark
@mark.r(script=Path("script.r"))
def task_run_r_script(produces: Path = Path("out.rds")):
pass
If you are wondering why the function body is empty, know that pytask-r replaces the body with a predefined internal function. See the section on implementation details for more information.
Dependencies and products can be added as usual. See this tutorial for some help.
To access the paths of dependencies and products in the script, pytask-r stores the
information by default in a .json
file. The path to this file is passed as a
positional argument to the script. Inside the script, you can read the information.
library(jsonlite)
args <- commandArgs(trailingOnly=TRUE)
path_to_json <- args[length(args)]
config <- read_json(path_to_json)
config$produces # Is the path to the output file "../out.csv".
The .json
file is stored in the same folder as the task in a .pytask
directory.
To parse the JSON file, you need to install jsonlite.
You can also pass any other information to your script by using the @task
decorator.
from pathlib import Path
from pytask import mark, task
@task(kwargs={"number": 1})
@mark.r(script=Path("script.r"))
def task_run_r_script(produces: Path = Path("out.rds")):
pass
and inside the script use
config$number # Is 1.
In case a task throws an error, you might want to execute the script independently from pytask. After a failed execution, you see the command that executed the R script in the report of the task. It looks roughly like this
Rscript <options> script.r <path-to>/.pytask/pytask-r/<uuid4>.json
The decorator can be used to pass command line arguments to Rscript
. See the following
example.
@mark.r(script=Path("script.r"), options="--vanilla")
def task_run_r_script(produces: Path = Path("out.rds")):
pass
You can also repeat the execution of tasks, meaning executing multiple R scripts or passing different command line arguments to the same R script.
The following task executes two R scripts, script_1.r
and script_2.r
, which produce
different outputs.
for i in range(2):
@task
@mark.r(script=Path(f"script_{i}.r"))
def task_execute_r_script(produces: Path = Path(f"out_{i}.csv")):
pass
If you want to pass different inputs to the same R script, pass these arguments with the
kwargs
keyword of the @task
decorator.
for i in range(2):
@task(kwargs={"i": i})
@mark.r(script=Path("script.r"))
def task_execute_r_script(produces: Path = Path(f"output_{i}.csv")):
pass
and inside the task access the argument i
with
library(jsonlite)
args <- commandArgs(trailingOnly=TRUE)
path_to_json <- args[length(args)]
config <- read_json(path_to_json)
config$produces # Is the path to the output file "../output_{i}.csv".
config$i # Is the number.
You can also serialize your data with any other tool you like. By default, pytask-r also supports YAML (if PyYaml is installed).
Use the serializer
keyword arguments of the @pytask.mark.r
decorator with
@mark.r(script=Path("script.r"), serializer="yaml")
def task_example(): ...
And, in your R script use
library(yaml)
args <- commandArgs(trailingOnly=TRUE)
config <- read_yaml(args[length(args)])
Note that the YAML
package needs to be installed.
If you need a custom serializer, you can also provide any callable serializer
which
transforms data into a string. Use suffix
to set the correct file ending.
Here is a replication of the JSON example.
import json
@mark.r(script=Path("script.r"), serializer=json.dumps, suffix=".json")
def task_example(): ...
You can influence the default behavior of pytask-r with configuration values.
r_serializer
Use this option to change the default serializer.
[tool.pytask.ini_options]
r_serializer = "json"
r_suffix
Use this option to set the default suffix of the file which contains serialized paths to dependencies, products and more.
[tool.pytask.ini_options]
r_suffix = ".json"
r_options
Use this option to set default options for each task which are separated by whitespace.
[tool.pytask.ini_options]
r_options = ["--vanilla"]
Consult the release notes to find out about what is new.