Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to split off dask and generalize graph builder #12

Merged
merged 8 commits into from
Jul 18, 2023

Conversation

SimonHeybrock
Copy link
Member

This refactors Pipeline, adding build() which returns a generic graph (currently as a dict). This can be turned into a task graph that works with dask, or can be used to make graphviz.Digraph. The internal use of dask.delayed is removed completely.

I think I am way more happy with this split. This opens the doors for, e.g., dropping the dask requirement completely, or implement a custom dask Collection on top of this.

I am not fully happy with the interface yet. I think what we have here now can be suitable as a low-level interface that we can build on top of. I am considering, among other things, splitting the graph into a class that providers some helpers such as compute(), or facilities to build a dask graph. Not sure exactly where this is going, but it may be useful to review now before such changes.

Copy link
Member

@jl-wynen jl-wynen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!
I agree that there should be a class for Graph. In particular since it currently is represented in terms of a relatively complicated tuple. I'd even have a separate graph class that only has the encoding and graph primitives in addition to a task graph class that has the higher level ops like compute.

}
graph: Graph = {tp: (provider, bound, args)}
for arg in args.values():
graph.update(self.build(arg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential infinite loop here (or up to max stack depth). @YooSunYoung has a solution based on graphlib.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind just getting RecursionError?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It would be nicer to report cycles in the graph more clearly. Maybe even during construction of the pipeline.

_format_type(ret),
)
for ret, (provider, bound, args) in graph.items()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not hugely relevant since graphs are small, but could yield tuples here instead of building a temporary dict.

Copy link
Member

@nvaytet nvaytet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments. Looks good.

from .pipeline import Graph


def to_graphviz(graph: Graph) -> Digraph:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to also add a test for this.

In addition, do we want to make it explicit that it is using graphviz (as it is currently), or is that an implementation detail that the user should not care about? Meaning we would call it something like show_graph, as we have done for the transform_coords and plopp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the main point of this PR is the other refactor.

@SimonHeybrock
Copy link
Member Author

I'd even have a separate graph class that only has the encoding and graph primitives in addition to a task graph class that has the higher level ops like compute

This is essentially what a dask collection is (so original PR message). I am not sure this is the right interface for us, since we may frequently want to compute multiple intermediate results. Will need more thought before moving on.

@jl-wynen
Copy link
Member

I'd even have a separate graph class that only has the encoding and graph primitives in addition to a task graph class that has the higher level ops like compute

This is essentially what a dask collection is (so original PR message). I am not sure this is the right interface for us, since we may frequently want to compute multiple intermediate results. Will need more thought before moving on.

Not really. A dask collection is more like a node of the graph (with links to parents). So it defines a graph as a linked structure. I was talking about a graph class that has a useful encoding for the graph plus methods like dfs, add_node, etc. And a task graph class that knows how to call these methods to string together computations to get a certain result or make a dask graph.
The reason is that this lets us abstract away the graph fro m the rest. If this becomes complicated enough, we may want to switch to a library like networkx instead of implementing graph algorithms ourselves.

@SimonHeybrock
Copy link
Member Author

I'd even have a separate graph class that only has the encoding and graph primitives in addition to a task graph class that has the higher level ops like compute

This is essentially what a dask collection is (so original PR message). I am not sure this is the right interface for us, since we may frequently want to compute multiple intermediate results. Will need more thought before moving on.

Not really. A dask collection is more like a node of the graph (with links to parents). So it defines a graph as a linked structure.

Not really. A dask collection is a graph (given by a dict), and a set of keys.

I was talking about a graph class that has a useful encoding for the graph plus methods like dfs, add_node, etc. And a task graph class that knows how to call these methods to string together computations to get a certain result or make a dask graph.

My initial comment was on the task graph class, not the graph class

@SimonHeybrock
Copy link
Member Author

Due you guys prefer merging this, or do you want to see two more days of follow-up changes in the same PR?

Copy link
Member

@jl-wynen jl-wynen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due you guys prefer merging this, or do you want to see two more days of follow-up changes in the same PR?

I'm pretty sure we'll go with something like this. And since no one uses sciline, you can merge it. Let's keep the diffs short.

@SimonHeybrock SimonHeybrock merged commit ac51c03 into main Jul 18, 2023
@SimonHeybrock SimonHeybrock deleted the graphviz branch July 18, 2023 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants