Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploring Metrics on Experiment Tracking - User Testing Synthesis #1627

Closed
NeroOkwa opened this issue Jun 17, 2022 · 14 comments
Closed

Exploring Metrics on Experiment Tracking - User Testing Synthesis #1627

NeroOkwa opened this issue Jun 17, 2022 · 14 comments
Assignees
Labels
Component: Experiment Tracking 🧪 Issue/PR that addresses functionality related to experiment tracking Design: Research Type: Parent Issue Type: User Research Synthesis ✍️ Issues to document results from user research

Comments

@NeroOkwa
Copy link
Contributor

NeroOkwa commented Jun 17, 2022

Description

Ability to plot experiment metrics derived from pipeline runs.

This is based on the second high priority issue resulting from the experiment tracking user research, which is:

Visualisation: Ability to show plot /comparison graphs/hyper parameters to evaluate metrics tradeoff

What is the problem? 

  • Users want to be able to get live plot visualisation of model training, and map hyper parameters directly to model performance  

Who are the users of this functionality?

  • Data Scientist, Data Engineer

Why do our users currently have this problem?

  • "Just like to get live plots, or to get like a plot visualisation of the training directly without writing it yourself through file just inside Kedro the route, like just log to training or something else. And then you can just use Kedro instead of passing it to the node and then saving it to file."
  •  Current approach of passing it to the node is non-intuitive 
  • Current approach of mapping hyper parameters to timestamp vs mapping hyper parameters to metric output makes comparison difficult  

What is the impact of solving this problem?

  • "Be able to map hyper parameters to the performance so later on more easily track what was used to produce something vs the timestamp and result were I have to ask myself what did I do at this timestamp what did I do  8 days ago, so tracking hyper parameters"

What could we possibly do?

  • Provide the option to map hyper parameters to metric output 
  • Integration with Matplotlib (done in current Kedro-viz release 4.7.0
@NeroOkwa NeroOkwa added Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation Component: Experiment Tracking 🧪 Issue/PR that addresses functionality related to experiment tracking labels Jun 17, 2022
@NeroOkwa NeroOkwa self-assigned this Jun 17, 2022
@yetudada
Copy link
Contributor

For this issue, it's worth noting that we do have this functionality already.

It's just that:

  • Users are unaware of the feature because it's not documented
  • It's also not in an intuitive place. The functionality was released as the first MVP for experiment tracking because we already have integrated Plotly and it was a simple change. However, it's currently located on the pipeline visualisation tab, not the experiment tracking one. This design also assumes that you know which tracking dataset to click to see the metrics plotted over time which is a terrible experience if your pipeline has many elements.

@tynandebold tynandebold moved this to Todo in Kedro-Viz Jun 20, 2022
@antonymilne
Copy link
Contributor

Note that plotting metrics against parameters and/or kedro runs is a big topic which has been considered by many different tools and also discussed by us before:
https://github.com/quantumblacklabs/private-kedro/issues/1192
#1070 (copy of above issue to public repo but missing some posts)

Just don't want previous discussions or existing solutions from other products to be forgotten about here 🙂

@yetudada yetudada changed the title Experiment Tracking Adoption: Issue 2 - Ability to show plot /hyper parameters for metrics tradeoff. Ability to plot metrics derived from pipeline runs Jun 23, 2022
@yetudada yetudada added this to Roadmap Jun 23, 2022
@yetudada yetudada removed this from Kedro-Viz Jun 23, 2022
@yetudada yetudada moved this to Next in Roadmap Jun 23, 2022
@yetudada yetudada changed the title Ability to plot metrics derived from pipeline runs Ability to plot experiment metrics derived from pipeline runs Jun 23, 2022
@yetudada yetudada moved this from Now - Discovery or Research to Later - Discovery or Research in Roadmap Jul 27, 2022
@tynandebold tynandebold moved this to Todo in Kedro-Viz Aug 1, 2022
@comym comym moved this from Todo to In Progress in Kedro-Viz Aug 1, 2022
@tynandebold
Copy link
Member

tynandebold commented Aug 1, 2022

We should be careful with our assumptions here. Some notes about that:

  • As is right now, the X-axis is timestamp, and that's impractical. Should have a way for it to be uniform so you don't have clusters of runs and then huge gaps in time.
  • Y-axis doesn't need to only be between 0 and 1. It can be arbitrarily high or low and it's very possible you'd want to plot multiple metrics on the same scale, one with a huge scale range and then another one that's very small. You could normalize the scales or use a parallel coordinates plot.

Bottom line is that the data aren't always going to be nice, not always between 0 and 1, or play nice together if a user is tracking multiple metrics on one plot.

Do reference @AntonyMilneQB's comment here for more context.

Let's pick @noklam's brain about this, too. He may have some great real-world experience with some other tools in this space that do similar things.

@noklam
Copy link
Contributor

noklam commented Aug 1, 2022

As I understand we are discussing comparison plots across runs here.

As is right now, the X-axis is timestamp, and that's impractical. Should have a way for it to be uniform so you don't have clusters of runs and then huge gaps in time.

This feature are almost available for most experiment tracking tool, but this is usually for a X-axis within the same run, but I think it's mostly valid for cross-runs as well.

  • Steps (incremental counter), i.e. 1,2,3,4 etc, which give you an even interval
  • Relative (forgot what's this)
  • Wall time (CPU time, similar to the elapse time)
    See this screenshot for Tensorboard (see the panel for "Horizontal Axis")
    image

See similar things on Weight & Biases, which is really flexible and you can configure
image

Y-axis doesn't need to only be between 0 and 1. It can be arbitrarily high or low and it's very possible you'd want to plot multiple metrics on the same scale, one with a huge scale range and then another one that's very small. You could normalize the scales or use a parallel coordinates plot.

  • Scaling / Ignoring outlier / Select top N runs
  • Changing Chart type
  • Smoothing

I think it all makes sense, but some of the features would be difficult to implement, and the live plot is mentioned in this issue. The more raw data you keep, the more flexible you can customize these plots later. Another limiting factor for the live plot is we only save output at the end of a node execution. We need to keep data at a more granular level to support live plots and these chart customizations. It will be a huge change on the backend though and doesn't feel quite well with the node execution paradigm.

Side note:
AFAIK W&B is also running with GraphQL API with vega(or vega-lite), which is based on d3.js. In python there is altair which support vega-lite. This crazy example shows how customizable it could be, though it's not a common use case.

@antonymilne
Copy link
Contributor

Just to clarify, I don't think live plotting of metric vs. epoch is in scope here at all (as @noklam says, we can't do anything like that without a lot more work on kedro core and it would be quite a paradigm shift). For now we're just concerned with comparing metrics saved as a dataset (so from a node output) in one kedro run vs. the same dataset(s) in another kedro run. What does work "live" here is that when you do kedro run, the newest datasets are available in Kedro-Viz straight away without refreshing or needing to restart the server thanks to the GraphQL subscription.

@yetudada yetudada assigned comym and Mackay031 and unassigned NeroOkwa and comym Aug 8, 2022
@yetudada yetudada assigned comym and unassigned Mackay031 Aug 22, 2022
@yetudada yetudada moved this from Discovery or Research - Later 🧪 to Discovery or Research - Now ⏳ in Roadmap Aug 22, 2022
@yetudada
Copy link
Contributor

Hey everyone! I won't be in the Experiment Tracking review session tomorrow and I just have some thoughts on the current prototype design.

So from what I understand the original problem we're supposed to be solving is: "I'm choosing not to use Kedro-Viz Experiment Tracking because it doesn't allow me to visualise metrics over time."

I may be wrong but I assumed it would as simple as saying, "I've done 20 pipeline runs, I was tracking mean_absolute_percentage_error and I want to see how my mean_absolute_percentage_error changed over time by looking at a plot of the values against time on a chart." Is this view correct or incorrect?

The reason I ask this is because:

  • Kedro-Viz supports this already; it's just that this view is on the Pipeline Visualisation tab and users appear to not know about it. We have to document how to find it. PerformanceAI did support this view too (#A), as individual metrics plotted over the same time frame and it appears to be a feature that was used, see the time-series feature (#B).
  • The current prototype seems to solve a subset of this problem (but it's still a different problem) which is, "How do I plot different metrics against each other so I can make a better choice about which experiment to select?". We have seen this before, PerformanceAI had the spider diagram which was essentially a circular or radar version of the parallel plot (#C) and we know this diagram was not used when users had too many experiments on the chart because it became unreadable.

So at the end of the day, the question becomes which problem are we solving for our users to increase adoption of Kedro-Viz Experiment Tracking? Are our users choosing not to use Kedro-Viz Experiment Tracking because:

  • They think we don't support a way to visualise a metric over time?
  • Or, because we don't make it easy to compare multiple metrics over time and select the best experiment?

I'm inclined to think it's the first problem but I'm also happy to be proven wrong on this. So keeping in mind that I'm also making assumptions throughout this piece, I would propose the following structure for user testing, which would provide more insights into the impact of not delivering on either of those problem statements:

  1. Show the users how to find the metrics plots on the Pipeline Visualisation using demo.kedro.org
  2. Ask for feedback; does this feature support a way for them to visualise metrics over time? And how can it be improved?
  3. Show them the new prototype and ask similar questions. The assumption that users can only compare three experiments must be stated to the users.
  4. Ask the users if they would solely use the new prototype in their work and would not need the metrics plots on the Pipeline Visualisation tab to do their work because the success of this design should be that they don't need the first view that we shipped.

Visual References

A

Screenshot 2022-08-22 at 18 54 56

B

Screenshot 2022-08-22 at 18 57 09

C

Screenshot 2022-08-22 at 18 53 13

@antonymilne
Copy link
Contributor

One final thought while it occurs to me: you can actually sort of retain the the time ordering in the parallel coordinates plot if you colour the lines somehow, e.g. to show the oldest ones fainter than the most recent ones. Not super important because I don't think the time ordering is that important, but at least highlighting the most recent run might be nice.

@tynandebold
Copy link
Member

As I said in the meeting yesterday, my intuition and instinct around what a user may want for new features here isn't sharp. I defer to @AntonyMilneQB, @noklam, and others who have used things like this in the past while doing real DS/DE/ML work.

What I do think we need is consistency with our hierarchy of information and a viable amount of added value with whatever we develop next. A few things stood out to me during the meeting yesterday:

  • A Parallel Coordinates plot should take precedence over a Time-series one.
  • We should reassess our limit of only allowing three runs to be compared.
  • If we plan to segment our displays into something like Overview, Metrics, and Plots, let's ensure that the default view is the most valuable and most-often used.

I'm excited to hear what our interviewees say when this is shown to them.

Lastly, calling out @noklam here. Please add some thoughts and comments if you have some. I think they're invaluable here!

@antonymilne
Copy link
Contributor

While browsing the original issue I came across this from @mkretsch327 (ex-QB data scientist). Basically I think DS (me, Nok, Matt) like the parallel coordinates plot 👍

For a metrics-over-runs view, I've found a parallel coordinate-like plot(essentially a flattened version of the circular metrics plot from PerformanceAI) to be super-useful. A majority of the time I'm looking to see what runs resulted in metrics that are at the extremes of a range (high or low), and that chart ends up providing that information concisely, even for relatively large numbers of metrics.

@yetudada
Copy link
Contributor

yetudada commented Sep 7, 2022

I'm happy for this. I will say that we will prioritise one view to solve the original user problem that was raised. At this point it's either parallel coordinates or time-series, it won't be both because we have other problems to solve once this is completed. And I want to feel certain that if we acted on kedro-org/kedro-viz#1000 that we would be doing the right thing.

Admittedly, I am a bit nervous about the parallel plot because we had feedback about the spider diagram when we were evaluating PAI. I highlighted the relevant insight in dark pink.

Screenshot 2022-09-07 at 14 51 51
Click the image to head through to the research

@comym
Copy link
Contributor

comym commented Sep 7, 2022

Let's see what users say.

I see how the spider diagram might be confusing for some (even though is the same thing as parallel coordinates). It might look cool for some but the fact it was circular added too much in [visual] complexity and more difficult readability. This is not an issue related to this specific graphic but a universal visual design fact. When flattening "the same" into a horizontal alignment it becomes much more digestible.

I understand picking one or the other for now for the sake of practicality and moving forward iteratively, but I would not ignore one or the other since they are different ways of exploring the data from different angles.

Again, let's ask the right questions and listen to what users say over the sessions. Loads of great insights are coming.

@comym comym changed the title Ability to plot experiment metrics derived from pipeline runs Exploring Metrics on Experiment Tracking - Prepare initial concepts to be validated with users Sep 8, 2022
@comym comym moved this from In Progress to In Review in Kedro-Viz Sep 8, 2022
@yetudada yetudada moved this from In Review to Done in Kedro-Viz Sep 12, 2022
@NeroOkwa NeroOkwa changed the title Exploring Metrics on Experiment Tracking - Prepare initial concepts to be validated with users Exploring Metrics on Experiment Tracking - User Testing Synthesis Oct 19, 2022
@NeroOkwa NeroOkwa moved this from Done to In Progress in Kedro-Viz Oct 19, 2022
@NeroOkwa NeroOkwa assigned NeroOkwa and unassigned comym Oct 19, 2022
@NeroOkwa NeroOkwa moved this from In Progress to In Review in Kedro-Viz Oct 24, 2022
@NeroOkwa NeroOkwa moved this from In Review to Done in Kedro-Viz Nov 8, 2022
@NeroOkwa
Copy link
Contributor Author

NeroOkwa commented Nov 9, 2022

User Testing Synthesis - Results

Goal and Methodology

The goal of this session was to evaluate the usability and value risk of the proposed feature on #1627(tracking metrics over time) through a low-fidelity mockup and a high-fidelity prototype.

The research used a qualitative (interview 🎤 - 6 participants) and quantitative (polls 🗳️) approach across the QuantumBlack and open-source user bases.

1 - Experiment Tracking Use Case

Summary: 2/6 users currently use kedro experiment tracking feature. Experiment tracking was used by users to understand their experiments and to find the best one by iterating with different parameter, to produce different metrics. This was done using MLflow, Weights & Biases, and Tableau

  • “I use experiment tracking to understand all my experiments and find the best one. When I’m iterating on a model, I'm probably gonna test different combination of parameters from my kedro pipeline and maybe even different dataset. And for each of those, I'm gonna log different success metrics. I use experiment tracking to log and visualise all those metrics so I don't need to go through each of the individual datasets”

2 - On Plotly Visualisation in Flowchart Mode

Summary: 3/6 users know of this feature and have used it to plot their metrics. One user mentioned that its location is non-intuitive and difficult to find for non-users

  • “Yeah I think it will cover my use case for the metrics that we're plotting right now, like regarding one single experiment”.
  • “It's not intuitive for the user to get the plot there. Like me knowing that it's there is fine, but I feel like others would not know it’s there. Also for large pipelines, it's not easy to find that specific node and click in there to see the plot”.

3 - Knowing which Metrics to track

Summary: 3/6 users start with a clear metric to track defined by the project, while others don’t and are more exploratory.

  • “It depends a lot on the project, but while I usually try to have one main metric that defines how well a model is doing or how well my experiments are going, I like seeing many of them at the same time. Yeah maybe this one improved, but this one went down and those kinds of slightly more complex relationships”.

4 - On New Tab Design

Summary: All 6 users prefer this new tab design

  • “I think they're nice, intuitive. They don't get in the way. I think it's actually very nice to be able to browse through tabs instead of scrolling down to find the plots for example”.

5 - On Plots: Parallel Coordinates & Time Series

Summary: 2 users each like time series and parallel coordinate plots, and 2 users like and would use both plots for different use cases.

  • When a user is focusing on a single run and a single metric, time series plot is preferred. When a user is focused on multiple runs and wants to quickly identify the best performing run or get an overview, parallel coordinates plot is preferred.
  • A source of confusion on parallel coordinates plot was the axis - how to combine different values with different scales
  • “So maybe both are useful. Because maybe if you want to inspect one particular metric then it's better to have an individual plot(time series), right? If you want to zoom in or something and then maybe you can have parallel if you find a way to have the scale adjusted in a way that you can get a sense of the bigger picture”

6 - On Comparison Mode

Summary: 4/6 users preferred comparison mode in parallel coordinate mode compared to time series. 1 user found comparison mode and the ‘metrics’ tab confusing.

  • “To me right now, it's a bit confusing, the difference between comparison mode and for example the metrics stuff, because in the end if you're not in comparison mode but doing the metrics tab, you're still comparing runs”.
  • “The confusion lies because overview is one run, plots is also one run, but then metrics is all runs and then on top of that, you have run comparison mode”.

7 - Pain Points

Summary: The most common pain point identified by 4/6 users was the axis, or the ability to change the scales or customize the values to be in percentages for easy comparison.

  • “Even though we were able to very quickly compare metrics over time, by looking at a table, it was very hard to compare them because Kedro-Viz didn't round them for you or turn them into percentages.”
  • The other pain points were the need to Filter an experiment using a particular parameter
  • “I want to see how my metrics change based on this value of a hyper parameter. But at the moment I cannot filter in experiment tracking on this value of my run. So then this is where I'm limited with Kedro-Viz”.

Features still missing for User’s Pain Point

Summary: There were general feature requests and those specific to the plots. The most common general features identified by 3/6 users was Filtering, followed by the ability to change the axis or Customize the metric values.

  • General:
    • FILTERING - Existing ticket #1039
      • "But actually I haven't been able to find a way yet in Kedro-Viz to to filter on this value that I'm tracking”.
      • "Maybe some kind of way to filter runs by metric values, sort of like ones where the error rate was less than the 0.5 or something like that”.
    • CUSTOMISATION
      • “One is if possible to allow the user to choose how they want metrics to be displayed in terms of like formats, like do they want it to be a percentage? Do they want it to have a lot of digits? Just a few digits ? always defaulting to a decimal of five digits can be pretty overwhelming”
  • Specific:
    • Time Series - “Ability to zoom in and out in the graphs to show more of a specific period ”. “If this goes on and on to the right or to the left, I would expect to be able to shift scroll to go to the right, and to the left. So basically the three scroll functionalities, the normal scroll, the control, and the shift”
    • Parallel Coordinate - AXIS/CUSTOMISATION - “The best way would probably be some metadata that I set up so that I can configure this graph to show for example either I want to show them as percentages or not, or maybe set up on the metric what I want to be the maximum value to show on this graph”
    • Plots Tab - DATA DESCRIPTION - “So basically I would like to have a description or something that I can click so that I can understand the data that is involved here. Because if I'm going to show this to a client for example I don't want to keep switching from documentation to here”.
  • 2/5 users indicated that if these changes are implemented, kedro experiment tracking would be their preferred and sole experiment tracking tool.
    • “Yeah, so for me at the moment I think all of the features you have already would bring me a lot of value. The only thing is if I will have like a better filtering strategy then I think I will basically do nearly all of what I am interested in doing with visualization and tracking metrics”

Problems we still need to consider for the future

  • Multi-user capability - Existing ticket #1628
    • “It was easier for me to just put the MLflow in the cloud and just report there. And I thought that the native kedro one at least when I tried it was still very coupled, and I didn't know how to make many nodes just respond to one database, and share the database across many data scientists and many projects and stuff like that”.
  • Kedro-Viz Blockers - Existing ticket #987
    experiment tracking tool.
    • “The reality is it’s harder for people to get access to Viz. I know platform McKinsey has made it easier to host Kedro-Viz but I think you still need to do some kind of work around”.

@yetudada
Copy link
Contributor

yetudada commented Aug 3, 2023

I'll close this 🥳 This theme is complete.

@yetudada yetudada closed this as completed Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Experiment Tracking 🧪 Issue/PR that addresses functionality related to experiment tracking Design: Research Type: Parent Issue Type: User Research Synthesis ✍️ Issues to document results from user research
Projects
Status: Done
Status: Shipped 🚀
Development

No branches or pull requests

8 participants