Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kedro-Viz flowchart timeline implementation brainstorm #903

Closed
antonymilne opened this issue Jun 10, 2022 · 3 comments
Closed

Kedro-Viz flowchart timeline implementation brainstorm #903

antonymilne opened this issue Jun 10, 2022 · 3 comments

Comments

@antonymilne
Copy link
Contributor

antonymilne commented Jun 10, 2022

Context

Providing some sort of timeline view in kedro-viz has come up several times in the past:

  • when exploring possibilities for experiment tracking, I believe everyone independently proposed some sort of timeline view as a feature
  • it was worked on in a hackathon
  • it's Milestone 4 on experiment tracking

Just to give a very rough idea of what I'm talking about, here's what was made in the hackathon:
image

This issue is for rough thoughts on how we might implement a simple timeline view in kedro-viz that keeps track of how the pipeline flowchart changes between kedro runs. It is meant to be an MVP, not a fully fledged timeline view with all the bells and whistles showing everything we might eventually want. The scope here is deliberately very restricted:

The timeline would be a one-dimensional array of points, ordered in time, where each point corresponds to a kedro run command and can be clicked to show the kedro-viz pipeline flowchart as it was when that command was executed. It won't include historic metadata (e.g. what the code was when that node was run) or any experiment tracking stuff, just the pipeline structure.

"one-dimensional" means that, compared to the above picture, I'm not putting anything on the y axis. It's just a single x-axis line with blobs on it.

Things not considered here:

  • motivation for why we might want to do this, what the priority should be, any user research, etc.
  • exactly what a more complete timeline should show (git commits? experiment tracking data on a y-axis?)
  • design work on where the timeline should go, what it should look like, etc.
  • how it would relate to experiment tracking
  • live updates on kedro-viz as the kedro pipeline runs
  • showing on kedro-viz which nodes were actually executed as part of a kedro run
  • triggering runs from kedro-viz

These are all important points, and some of them seem like very obvious next steps for a timeline. But here I'm just jotting down current thoughts while they're fresh in my mind following a recent chat with @tynandebold.

@antonymilne
Copy link
Contributor Author

antonymilne commented Jun 10, 2022

Proposed backend implementation

I don't think any changes will be required on the framework side; all the implementation is on the kedro-viz side. I think we have everything we need in kedro core already thanks to the magic of the framework plugins system. 🎉 🚀 So long as you have done pip install kedro-viz, the pipeline json files will be created, and if you don't want that then you can disable the kedro-viz hooks in settings.py.

Here's what we'd need:

  1. Generate pipeline json without launching kedro-viz server
  2. Hook to generate the pipeline json
  3. Configure where the pipeline json is saved
  4. Send list of all pipeline json timestamps to frontend in kedro-viz
  5. Send a requested pipeline json to the frontend in kedro-viz

1. Generate pipeline json without launching kedro-viz server

Estimated difficulty: very easy

Currently kedro viz --save-file does the following in run_server:

  • uses kedro_data_loader.load_data and populate_data to fetch all the relevant kedro project information (catalog, pipelines, etc.) in the right place
  • arranges this according to GraphAPIResponse
  • writes this to a json file
  • launches kedro viz app

As far as I can tell, there's actually no reason for the last step to execute here at all. So let's extract the necessary stages into a new function, pipeline_to_json (name tbd) which is called by run_server. We could then call pipeline_to_json independently to make a pipeline json without needing to launch any server.

The kedro framework Pipeline.to_json is not useful here and should just be ignored. The json format required is provided by GraphAPIResponse.

One slight catch: we should maybe rethink how we deal with what happens if you run save-file with a particular pipeline and how to deal with the bug where nodes aren't present in __default__ pipeline (will write this up in a separate issue).

2. Hook to generate the pipeline json

Estimated difficulty: pretty easy

We'll need a hook (or maybe hooks) that runs when you do kedro run, calls pipeline_to_json and saves to a file that is labelled by the session_id (basically a timestamp). This should be very straightforward.

Thoughts:

  • should it still save if the pipeline doesn't complete successfully?
  • if pipeline_to_json uses kedro_data_loader.load_data then there is some redundancy/inefficiency here. A kedro session will already be open, and objects like catalog and pipeline are already directly available as hook arguments. The kedro-viz data loader would create its own session to fetch the same objects. The data loader provides a nice backwards-compatible system but probably that's not useful here because you couldn't have nested sessions until very recently in kedro. So this might be a feature that's only available in kedro 0.18.1+.
  • need to consider exactly why pipeline(s) need to be saved here - is just __default__ ok? Depends a bit on the catch mentioned in point 1

3. Configure where the pipeline json is saved

Estimated difficulty: very easy to implement, but needs careful thought about right way to do it

Two possible schemes here:

  • store the json in the session store
  • store it as a versioned file (currently my preferred choice: offers remote storage, multi-user experience, easy deletion of old runs without needing to tinker with a database)

Either way, we'll probably want a new viz_timeline.yml config file to specify how this should be done (e.g. the file path location if we go for the second option). If we use the new after_context_created hook (kedro 0.18.1+), we can easily load this up through context.config_loader.

Need to consider:

  • what is format of this configuration file? What things might be added to it in future? Should it be viz_timeline.yml specifically for configuring this feature or a more general viz.yml that might be extended in future for other kedro-viz features?
  • should we define the location using a versioned JSON dataset catalog entry? A tracking.JSONDataSet?
  • if we have some sort of schema for the config file, should we include a version number to allow for future changes?

4. Send a list of all pipeline json timestamps to frontend in kedro-viz

Estimated difficulty: ok

Kedro-Viz frontend will need a list of timestamps to show on the the timeline. This is very analogous to how experiment tracking gets a set of kedro run timestamps and metrics datasets (which uses both the session store and the versioned dataset files).

Note: if I understand correctly where GraphQL is currently used in kedro-viz, I think this wouldn't be able to automatically insert a new blob on the timeline when a new run is done (you'd need to restart the server, or maybe the --autoreload flag would do it?). This is a bit of a pity but ok.

5. Send a requested pipeline json to the frontend in kedro-viz

Estimated difficulty: ok

Kedro-Viz UI will request the server sends the pipeline json for a particular session_id. We need an endpoint that will handle this.

Need to understand better exactly how the current endpoints work to figure out the best way to do this, but it doesn't sound too hard: should just be a case of retrieving the relevant pipeline json from the session store or file location and returning it.

@antonymilne
Copy link
Contributor Author

More random thoughts

Do we want to show all the blobs on the timeline? Chances are that changes to the pipeline flowchart are much less frequent than kedro runs. Hence you could have many blobs next to each other that have the same flowchart, which would be pretty boring. Maybe we want some way of grouping together (or hiding) blobs that have the same flowchart. e.g. we could do a diff between sequential pipeline jsons (not sure where in the process this would belong).

@yetudada
Copy link
Contributor

So this issue is back on the table because of the conversations around #1218, #1234 and more. I think we need to ground this in a problem question which is, "How might we allow our users to version their pipeline visualisation alongside their experiments?"

@kedro-org kedro-org locked and limited conversation to collaborators Mar 27, 2024
@rashidakanchwala rashidakanchwala converted this issue into discussion #1828 Mar 27, 2024
@github-project-automation github-project-automation bot moved this from Backlog to Done in Kedro-Viz Mar 27, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
Status: Done
Development

No branches or pull requests

5 participants