Skip to content

Commit

Permalink
Merge pull request #124 from scipp/save-graph
Browse files Browse the repository at this point in the history
JSON serializer for task graphs
  • Loading branch information
jl-wynen authored Mar 13, 2024
2 parents 8467e25 + 3ef4237 commit 9b970ad
Show file tree
Hide file tree
Showing 26 changed files with 1,422 additions and 112 deletions.
3 changes: 2 additions & 1 deletion docs/api-reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
scheduler.Scheduler
scheduler.DaskScheduler
scheduler.NaiveScheduler
task_graph.TaskGraph
TaskGraph
```

## Exceptions
Expand All @@ -41,5 +41,6 @@
:template: module-template.rst
:recursive:
serialize
sphinxext
```
1 change: 1 addition & 0 deletions docs/user-guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ maxdepth: 2
getting-started
parameter-tables
generic-providers
serialization
```
237 changes: 237 additions & 0 deletions docs/user-guide/provenance.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1e07a87b-a043-458f-ad37-4712c3c42f2c",
"metadata": {},
"source": [
"# Provenance\n",
"\n",
"It is generally useful to be able to track the provenance of a dataset, that is, where it came from and how it was processed.\n",
"Sciline can help because its task graphs encode the 'how'.\n",
"To this end, a graph needs to be stored together with the processed data.\n",
"\n",
"Considering that such graphs might be stored for a long time, they need to be serialized to a format that\n",
"\n",
"- represents the full structure of the graph,\n",
"- is readable by software that does not depend on Sciline or Python,\n",
"- is human readable (with some effort, the priority is machine readability).\n",
"\n",
"Points 2 and 3 exclude serializing the full Python objects, e.g., with `pickle`.\n",
"But this means that any solution will be partial as it cannot capture the full environment that the pipeline is defined in.\n",
"In particular, it cannot track functions called by providers that are external to the pipeline.\n",
"See the section on [Reproducibility](#Reproducibility).\n",
"\n",
"Note that the [Graphviz](https://graphviz.org/) objects produced by [Pipeline.visualize](../generated/classes/sciline.Pipeline.rst#sciline.Pipeline.visualize) are not sufficient because they do not encode the full graph structure but are instead optimized to give an overview of a task graph.\n",
"\n",
"<div class=\"alert alert-warning\">\n",
" \n",
"**Attention:**\n",
"\n",
"Sciline does not currently support serializing the values of parameters.\n",
"This is the responsibility of the user, at least for now.\n",
"\n",
"</div>\n",
"\n",
"## Serialization of task graphs to JSON\n",
"\n",
"Task graphs can be serialized to a simple JSON object that contains a node list and an edge list.\n",
"This format is similar to other JSON graph formats used by, e.g., [Networkx](https://networkx.org/) and [JSON Graph Format](http://jsongraphformat.info/).\n",
"\n",
"First, define a helper to display JSON:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7836a2b-c7bf-4dd8-ad3c-bbe4a585ed84",
"metadata": {},
"outputs": [],
"source": [
"from IPython import display\n",
"import json\n",
"\n",
"def show_json(j: dict):\n",
" return display.Markdown(f\"\"\"```json\n",
"{json.dumps(j, indent=2)}\n",
"```\"\"\")"
]
},
{
"cell_type": "markdown",
"id": "5b16b03c-e12c-4642-9878-05c53ed05a6f",
"metadata": {},
"source": [
"### Simple example\n",
"\n",
"First, construct a short pipeline, including some generic types and providers:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "71afe777-1c9c-4d97-8c48-d2fac427b073",
"metadata": {},
"outputs": [],
"source": [
"from typing import NewType, TypeVar\n",
"import sciline\n",
"\n",
"A = NewType('A', int)\n",
"B = NewType('B', int)\n",
"T = TypeVar('T', A, B)\n",
"\n",
"class Int(sciline.Scope[T, int], int): ...\n",
"\n",
"def make_int_b() -> Int[B]:\n",
" return Int[B](2)\n",
"\n",
"def to_string(a: Int[A], b: Int[B]) -> str:\n",
" return f'a: {a}, b: {b}'\n",
"\n",
"pipeline = sciline.Pipeline([make_int_b, to_string], params={Int[A]: 3})\n",
"task_graph = pipeline.get(str)\n",
"task_graph.visualize(graph_attr={'rankdir': 'LR'})"
]
},
{
"cell_type": "markdown",
"id": "44a790a9-53b1-409a-b6e5-fbb70c7ec7b0",
"metadata": {},
"source": [
"This graph can be serialized to JSON using its [serialize](../generated/classes/sciline.TaskGraph.rst#sciline.TaskGraph.serialize) method.\n",
"We need to use the task graph obtained from `pipeline.get` for this purpose, not the pipeline itself:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1be6cda-7dfd-4047-ba7e-8a8fd88b3ca0",
"metadata": {},
"outputs": [],
"source": [
"show_json(task_graph.serialize())"
]
},
{
"cell_type": "markdown",
"id": "ae4a8925-b115-41bc-a618-6ad23dd27fa1",
"metadata": {},
"source": [
"Let's disect the format.\n",
"\n",
"The `directed` and `multigraph` properties always have the same values.\n",
"They are included for compatibility with [Networkx](https://networkx.org/) and [JSON Graph Format](http://jsongraphformat.info/).\n",
"\n",
"Note the use of qualified names.\n",
"Those make it easier to identify exactly what types and functions have been used while the `label` is a shortened representation.\n",
"\n",
"All ids are unique across nodes and edges.\n",
"\n",
"#### `nodes`\n",
"An array of node objects.\n",
"The nodes always have an `id`, `label`, `kind`, and `out` property.\n",
"\n",
"- `id` is a unique identifier of the node. (**Do not rely on it having a specific format, this may change at any time!**)\n",
"- `label` is a human-readable name for the node.\n",
"- `kind` indicates what the node represents, there are parameter nodes and function nodes which correspond to parameters and providers in the pipeline, respectively.\n",
"- `out` holds the fully qualified name of the type of object that the node produces. That is, the type of the parameter or the return type of the function.\n",
"\n",
"Depending on their `kind`, nodes have additional properties.\n",
"For function nodes, there is a `function` property which stores the fully qualified name of the function that the provider uses.\n",
"In addition, there are `args` and `kwargs` properties which list *edge* ids for all arguments and keyword arguments of the function.\n",
"\n",
"#### `edges`\n",
"An array of directed edges.\n",
"\n",
"- `id` is a unique identifier of the edge.\n",
"- `source` and `target` refer to the `id` field of a node."
]
},
{
"cell_type": "markdown",
"id": "279948ef-67ac-4434-9725-7b9b380c48ed",
"metadata": {},
"source": [
"### Reproducibility\n",
"\n",
"The JSON format used here was chosen as a simple, future-proof format for task graphs.\n",
"It can, however, only capture part of the actual pipeline.\n",
"For example, it only shows the structure of the graph and contains the names of functions and types.\n",
"But it does not encode the implementation of those functions or types.\n",
"Thus, the graph can only be correctly reconstructed in an environment that contains the same software that was used to write the graph.\n",
"This includes all packages that might be used by the providers.\n",
"\n",
"<div class=\"alert alert-info\">\n",
" \n",
"**Hint:**\n",
"\n",
"Note, for example, that the graphs here refer to functions and types in `__main__`, that is, functions and types defined in this Jupyter notebook.\n",
"These cannot be reliably reconstructed.\n",
"Thus, it is recommended to define all pipeline components in a Python package with a version number.\n",
"\n",
"</div>\n",
"\n",
"<div class=\"alert alert-warning\">\n",
" \n",
"**Warning:**\n",
"\n",
"Python 3.12 type aliases (`type MyType = int`) only allow for limited inspection of the alias.\n",
"In particular, they have no `__qualname__`.\n",
"This means that they can only be fully represented when defined at the top level of a module.\n",
"\n",
"</div>"
]
},
{
"cell_type": "markdown",
"id": "dfe68a98-5777-4b9b-9c02-029e32c01385",
"metadata": {},
"source": [
"### JSON schema\n",
"\n",
"The schema for the JSON object returned by `TaskGraph.serialize` is available as part of the Sciline package:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "055430c0-eedc-41df-afbb-962f15c20b21",
"metadata": {},
"outputs": [],
"source": [
"from sciline.serialize import json_schema"
]
},
{
"cell_type": "markdown",
"id": "6e3ba1f9-e8c1-4167-8368-7ed7cf8d3f46",
"metadata": {},
"source": [
"The schema is too long to show it here.\n",
"It is available online at https://github.com/scipp/sciline/blob/c7ae8e61883bbbf34c2053fba4a3128d887ea777/src/sciline/serialize/graph_json_schema.json"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ addopts = """
testpaths = "tests"
filterwarnings = [
"error",
# From jsonschema
"ignore:datetime.datetime.utcfromtimestamp:DeprecationWarning",
]

[tool.bandit]
Expand Down
1 change: 1 addition & 0 deletions requirements/basetest.in
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@
# Do not make an environment from this file, use test.txt instead!

graphviz
jsonschema
numpy
pytest
24 changes: 20 additions & 4 deletions requirements/basetest.txt
Original file line number Diff line number Diff line change
@@ -1,23 +1,39 @@
# SHA1:d40e3cd7a7e8ab8723d8e5496c92dfdab9b79849
# SHA1:f7d11f6aab1600c37d922ffb857a414368b800cd
#
# This file is autogenerated by pip-compile-multi
# To update, run:
#
# pip-compile-multi
#
attrs==23.2.0
# via
# jsonschema
# referencing
exceptiongroup==1.2.0
# via pytest
graphviz==0.20.1
# via -r basetest.in
iniconfig==2.0.0
# via pytest
numpy==1.26.3
jsonschema==4.21.1
# via -r basetest.in
jsonschema-specifications==2023.12.1
# via jsonschema
numpy==1.26.4
# via -r basetest.in
packaging==23.2
# via pytest
pluggy==1.3.0
pluggy==1.4.0
# via pytest
pytest==7.4.4
pytest==8.0.0
# via -r basetest.in
referencing==0.33.0
# via
# jsonschema
# jsonschema-specifications
rpds-py==0.17.1
# via
# jsonschema
# referencing
tomli==2.0.1
# via pytest
12 changes: 6 additions & 6 deletions requirements/ci.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
#
cachetools==5.3.2
# via tox
certifi==2023.11.17
certifi==2024.2.2
# via requests
chardet==5.2.0
# via tox
Expand All @@ -23,7 +23,7 @@ filelock==3.13.1
# virtualenv
gitdb==4.0.11
# via gitpython
gitpython==3.1.40
gitpython==3.1.41
# via -r ci.in
idna==3.6
# via requests
Expand All @@ -32,11 +32,11 @@ packaging==23.2
# -r ci.in
# pyproject-api
# tox
platformdirs==4.1.0
platformdirs==4.2.0
# via
# tox
# virtualenv
pluggy==1.3.0
pluggy==1.4.0
# via tox
pyproject-api==1.6.1
# via tox
Expand All @@ -48,9 +48,9 @@ tomli==2.0.1
# via
# pyproject-api
# tox
tox==4.11.4
tox==4.12.1
# via -r ci.in
urllib3==2.1.0
urllib3==2.2.0
# via requests
virtualenv==20.25.0
# via tox
Loading

0 comments on commit 9b970ad

Please sign in to comment.