Refactor to split off dask and generalize graph builder #12

SimonHeybrock · 2023-07-17T11:38:24Z

This refactors Pipeline, adding build() which returns a generic graph (currently as a dict). This can be turned into a task graph that works with dask, or can be used to make graphviz.Digraph. The internal use of dask.delayed is removed completely.

I think I am way more happy with this split. This opens the doors for, e.g., dropping the dask requirement completely, or implement a custom dask Collection on top of this.

I am not fully happy with the interface yet. I think what we have here now can be suitable as a low-level interface that we can build on top of. I am considering, among other things, splitting the graph into a class that providers some helpers such as compute(), or facilities to build a dask graph. Not sure exactly where this is going, but it may be useful to review now before such changes.

jl-wynen

Good idea!
I agree that there should be a class for Graph. In particular since it currently is represented in terms of a relatively complicated tuple. I'd even have a separate graph class that only has the encoding and graph primitives in addition to a task graph class that has the higher level ops like compute.

jl-wynen · 2023-07-17T12:02:17Z

src/sciline/pipeline.py

+        }
+        graph: Graph = {tp: (provider, bound, args)}
+        for arg in args.values():
+            graph.update(self.build(arg))


Potential infinite loop here (or up to max stack depth). @YooSunYoung has a solution based on graphlib.

Do you mind just getting RecursionError?

Yes. It would be nicer to report cycles in the graph more clearly. Maybe even during construction of the pipeline.

jl-wynen · 2023-07-17T12:05:29Z

src/sciline/graph.py

+            _format_type(ret),
+        )
+        for ret, (provider, bound, args) in graph.items()
+    }


Not hugely relevant since graphs are small, but could yield tuples here instead of building a temporary dict.

src/sciline/graph.py

nvaytet

Some small comments. Looks good.

nvaytet · 2023-07-17T12:07:37Z

src/sciline/graph.py

+from .pipeline import Graph
+
+
+def to_graphviz(graph: Graph) -> Digraph:


I think it would be a good idea to also add a test for this.

In addition, do we want to make it explicit that it is using graphviz (as it is currently), or is that an implementation detail that the user should not care about? Meaning we would call it something like show_graph, as we have done for the transform_coords and plopp.

Sure, the main point of this PR is the other refactor.

tests/pipeline_test.py

SimonHeybrock · 2023-07-17T12:45:18Z

I'd even have a separate graph class that only has the encoding and graph primitives in addition to a task graph class that has the higher level ops like compute

This is essentially what a dask collection is (so original PR message). I am not sure this is the right interface for us, since we may frequently want to compute multiple intermediate results. Will need more thought before moving on.

jl-wynen · 2023-07-17T12:49:44Z

I'd even have a separate graph class that only has the encoding and graph primitives in addition to a task graph class that has the higher level ops like compute

This is essentially what a dask collection is (so original PR message). I am not sure this is the right interface for us, since we may frequently want to compute multiple intermediate results. Will need more thought before moving on.

Not really. A dask collection is more like a node of the graph (with links to parents). So it defines a graph as a linked structure. I was talking about a graph class that has a useful encoding for the graph plus methods like dfs, add_node, etc. And a task graph class that knows how to call these methods to string together computations to get a certain result or make a dask graph.
The reason is that this lets us abstract away the graph fro m the rest. If this becomes complicated enough, we may want to switch to a library like networkx instead of implementing graph algorithms ourselves.

SimonHeybrock · 2023-07-17T12:53:44Z

I'd even have a separate graph class that only has the encoding and graph primitives in addition to a task graph class that has the higher level ops like compute

This is essentially what a dask collection is (so original PR message). I am not sure this is the right interface for us, since we may frequently want to compute multiple intermediate results. Will need more thought before moving on.

Not really. A dask collection is more like a node of the graph (with links to parents). So it defines a graph as a linked structure.

Not really. A dask collection is a graph (given by a dict), and a set of keys.

I was talking about a graph class that has a useful encoding for the graph plus methods like dfs, add_node, etc. And a task graph class that knows how to call these methods to string together computations to get a certain result or make a dask graph.

My initial comment was on the task graph class, not the graph class

SimonHeybrock · 2023-07-17T13:47:46Z

Due you guys prefer merging this, or do you want to see two more days of follow-up changes in the same PR?

jl-wynen

Due you guys prefer merging this, or do you want to see two more days of follow-up changes in the same PR?

I'm pretty sure we'll go with something like this. And since no one uses sciline, you can merge it. Let's keep the diffs short.

SimonHeybrock added 6 commits July 14, 2023 13:19

Add Pipeline.get_graph and graph.make_graph

da85516

Rename to to_graphviz

f7693f1

Partial refactor to split graph build from dask, avoiding dask.delayed

c8c4509

Try removing troublesome NewType annotations

e45a5ab

Remove caching and remove Pipeline.get

ecef013

Spelling

b24ac34

jl-wynen reviewed Jul 17, 2023

View reviewed changes

nvaytet reviewed Jul 17, 2023

View reviewed changes

SimonHeybrock added 2 commits July 17, 2023 14:21

Minor

42a9b98

Raise CycleError instead of RecursionError

63f3973

jl-wynen approved these changes Jul 17, 2023

View reviewed changes

SimonHeybrock merged commit ac51c03 into main Jul 18, 2023

SimonHeybrock deleted the graphviz branch July 18, 2023 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to split off dask and generalize graph builder #12

Refactor to split off dask and generalize graph builder #12

SimonHeybrock commented Jul 17, 2023

jl-wynen left a comment

jl-wynen Jul 17, 2023

SimonHeybrock Jul 17, 2023

jl-wynen Jul 17, 2023

jl-wynen Jul 17, 2023

nvaytet left a comment

nvaytet Jul 17, 2023

SimonHeybrock Jul 17, 2023

SimonHeybrock commented Jul 17, 2023

jl-wynen commented Jul 17, 2023

SimonHeybrock commented Jul 17, 2023

SimonHeybrock commented Jul 17, 2023

jl-wynen left a comment

		from .pipeline import Graph


		def to_graphviz(graph: Graph) -> Digraph:

Refactor to split off dask and generalize graph builder #12

Refactor to split off dask and generalize graph builder #12

Conversation

SimonHeybrock commented Jul 17, 2023

jl-wynen left a comment

Choose a reason for hiding this comment

jl-wynen Jul 17, 2023

Choose a reason for hiding this comment

SimonHeybrock Jul 17, 2023

Choose a reason for hiding this comment

jl-wynen Jul 17, 2023

Choose a reason for hiding this comment

jl-wynen Jul 17, 2023

Choose a reason for hiding this comment

nvaytet left a comment

Choose a reason for hiding this comment

nvaytet Jul 17, 2023

Choose a reason for hiding this comment

SimonHeybrock Jul 17, 2023

Choose a reason for hiding this comment

SimonHeybrock commented Jul 17, 2023

jl-wynen commented Jul 17, 2023

SimonHeybrock commented Jul 17, 2023

SimonHeybrock commented Jul 17, 2023

jl-wynen left a comment

Choose a reason for hiding this comment