Dataflow cluster analysis #977

Ellpeck · 2024-09-16T12:03:46Z

We want to calculate clusters on the dataflow graph that determine which parts of code are highly dependent on each other (ie "belong together"). Open question: when can we ignore opposite-facing directed edges, and when do we have to traverse them?

Output should be a set of flowR node IDs that form a cluster, to be evaluated further later on.

Step 1: Simple/naive cluster calculation using reachability analysis.

Step 2: The result will likely be one large cluster because there are shared dependencies on setup steps, reused functions etc. We can implement a "bottleneck" node calculation that splits clusters on these sorts of nodes and creates separate clusters that all individually contain the "bottleneck" node. Open question: what constitutes a "bottleneck" node, ie when is it reused enough, and when is the cluster around it small enough, to be splittable?

Implementation as a separate "post analysis" module rather than a pipeline step.

Ellpeck added enhancement New feature or request dataflow Related to dataflow extraction labels Sep 16, 2024

Ellpeck linked a pull request Sep 18, 2024 that will close this issue

Dataflow cluster analysis #985

Merged

EagleoutIce mentioned this issue Sep 23, 2024

The New Query Api (aka [1]007 with the license to kill redundant frontend boilerplate) #1007

Open

5 tasks

EagleoutIce mentioned this issue Oct 10, 2024

[Query API] Expose Clustering #1053

Closed

EagleoutIce closed this as completed in #985 Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataflow cluster analysis #977

Dataflow cluster analysis #977

Ellpeck commented Sep 16, 2024

Dataflow cluster analysis #977

Dataflow cluster analysis #977

Comments

Ellpeck commented Sep 16, 2024