Change the DAG to have separate nodes for operations and arrays #337

tomwhite · 2023-12-17T13:15:22Z

Currently there is a one-to-one correspondence between operations and arrays: one operation produces one array, and is represented by one node in the DAG.

However, this won't be true when we support multiple outputs (#69) where one operation can produce multiple arrays.

Similarly, it also won't support optimizations like sibling fusion (in Apache Beam, and from FlumeJava originally), where operations that have the same inputs and produce different outputs can be fused into one operation producing multiple outputs. (Wukong calls this "task clustering".)

So we need to change the DAG representation so that one operation can produce multiple arrays, and this means breaking nodes into two types: operations and arrays.

This PR does exactly that. An example DAG (from test_add_with_broadcast) in the current representation

now looks like this:

Note:

Operations are shown as boxes with rounded corners, arrays are rectangles.
Colours indicate the following: pink for a blockwise operation, green for a rechunk operation, orange for a materialized array, white for an operation that is not run in its own pipeline or for an array that is not materialized to disk as a Zarr array ("virtual arrays")

I wondered about having an option to visualize DAGs as we currently do, but that won't work when there are multiple outputs, so it may be best just to have the more general representation.

tomwhite · 2024-01-02T12:31:44Z

@TomNicholas any thoughts on this one before merging?

TomNicholas · 2024-01-02T19:05:41Z

This looks great! I also asked my colorblind friend if he could distinguish all the colors and he gave it a thumbs up 👍

tomwhite · 2024-01-03T10:10:48Z

I also asked my colorblind friend if he could distinguish all the colors and he gave it a thumbs up 👍

Thank you!

tomwhite added the core label Dec 17, 2023

tomwhite mentioned this pull request Dec 19, 2023

Optimization tracking issue #339

Open

20 tasks

Change the DAG to have separate nodes for operations and arrays

e063364

tomwhite force-pushed the new-dag branch from d4a6c30 to e063364 Compare December 21, 2023 16:31

tomwhite merged commit f83e556 into main Jan 3, 2024
7 checks passed

tomwhite deleted the new-dag branch January 3, 2024 08:39

tomwhite added a commit that referenced this pull request Feb 12, 2024

Rename array_name to name in runtime following #337

e99b0da

tomwhite added a commit that referenced this pull request Feb 12, 2024

Rename array_name to name in runtime following #337 (#377)

009aa3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the DAG to have separate nodes for operations and arrays #337

Change the DAG to have separate nodes for operations and arrays #337

tomwhite commented Dec 17, 2023

tomwhite commented Jan 2, 2024

TomNicholas commented Jan 2, 2024

tomwhite commented Jan 3, 2024

Change the DAG to have separate nodes for operations and arrays #337

Change the DAG to have separate nodes for operations and arrays #337

Conversation

tomwhite commented Dec 17, 2023

tomwhite commented Jan 2, 2024

TomNicholas commented Jan 2, 2024

tomwhite commented Jan 3, 2024