Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate VegaFusion into JupyterChart #3281

Merged
merged 14 commits into from
Dec 26, 2023
Merged

Conversation

jonmmease
Copy link
Contributor

Overview

This PR updates JupyterChart to improve the integration with VegaFusion so that interactive data transformations can be performed in the Python kernel rather than the browser. This brings the capabilities of the dedicated VegaFusion Widget Renderer to JupyterChart.

Benefits

Let's start with an example of interactive crossfiltering on a 2 million row flights dataset.

import altair as alt
import pandas as pd
from vega_datasets import data

alt.data_transformers.enable("vegafusion")

# Load data
source = pd.concat([data.flights_2k()] * 1000, axis=0)
# source = pd.concat([data.flights_2k()] * 10, axis=0)
print(f"{len(source)} rows")

# Build crossfiltered chart
brush = alt.selection_interval(encodings=['x'])

# Define the base chart, with the common parts of the
# background and highlights
base = alt.Chart(width=160, height=130).mark_bar().encode(
    x=alt.X(alt.repeat('column')).bin(maxbins=20),
    y='count()'
)

# gray background with selection
background = base.encode(
    color=alt.value('#ddd')
).add_params(brush)

# blue highlights on the transformed data
highlight = base.transform_filter(brush)

# layer the two charts & repeat
chart = alt.layer(
    background,
    highlight,
    data=source
).transform_calculate(
    "time",
    "hours(datum.date)"
).repeat(column=["distance", "delay", "time"])

alt.jupyter.JupyterChart(chart)
2m.mov

With this PR, the full dataset is never sent to the browser. Each time a selection is changed, a signal is sent from the widget to the Python kernel and the filtering and aggregation are performed in Python by VegaFusion, and the result is pushed back to the browser.

How it's enabled

As before, VegaFusion is enabled in Altair globally using alt.data_transformers.enable("vegafusion"). When enabled, JupyterChart will automatically take advantage of VegaFusion. When not enabled, JupyterChart will continue to function as before. VegaFusion is still an optional dependency.

How it works

This PR takes advantage of the ChartState construct added to VegaFusion 1.5.0 (See vega/vegafusion#426). The ChartState performs the initial spec transformation and provides a watch plan which specifies the signals and datasets that must be sent from the vega renderer back to the ChartState in order to preserve chart interactivity.

The ChartState is also responsible for holding references to inline datasets. So unlike VegaFusion's VegaFusionWidget approach, source dataframes do not need to be written to disk 🎉

Note on local_tz

One subtlety is that the ChartState requires the browser's local timezone in order to perform its data transformations. Because of this, I added a local_tz traitlet that is set by the widget to the browser's local timezone. The Python side adds a callback on this traitlet and builds the ChartState onces it is available in Python.

Selection / Param access

When VegaFusion is enabled, it's still possible to access the Chart's selections and value parameters.

Future of VegaFusionWidget

If these updates are accepted into JupyterChart, I plan on deprecating VegaFusionWidget (and the entire vegafusion-jupyter Python package).

I'll update the Altair docs in a follow-on PR to remove mention of VegaFusionWidget and explain the functionality of JupyterChart when VegaFusion is enabled.

@jonmmease
Copy link
Contributor Author

@binste, when I run hatch run mypy altair tests locally I'm not seeing any mypy issues. Do you have any ideas on what I might need to do to reproduce the failures in lint / ruff-mypy locally?

@jonmmease
Copy link
Contributor Author

cc @domoritz, as this is something we've talked about at various times over the past couple of years

@domoritz
Copy link
Member

Very cool. Super exciting to have out-of-the-box scalability with Altair through this feature.

@mattijn
Copy link
Contributor

mattijn commented Dec 13, 2023

As always, thanks @jonmmease! To have all of this integrated smoothly within altair is really something! Foundational work!
Is it possible to enter a kind of debug mode? Logging or printing what is being queried within vegafusion upon interacting with a visualisation?

I tried the example Interactive Chart with Aggregation but it is not working with vegafusion enabled:

import altair as alt
from vega_datasets import data
alt.data_transformers.enable("vegafusion")

source = data.movies.url

slider = alt.binding_range(min=0, max=10, step=0.1, name="threshold")
threshold = alt.param(value=5, bind=slider)

chart = alt.layer(
    alt.Chart(source).mark_circle().encode(
        x=alt.X("IMDB_Rating:Q").title("IMDB Rating"),
        y=alt.Y("Rotten_Tomatoes_Rating:Q").title("Rotten Tomatoes Rating")
    ).transform_filter(
        alt.datum["IMDB_Rating"] >= threshold
    ),

    alt.Chart(source).mark_circle().encode(
        x=alt.X("IMDB_Rating:Q").bin(maxbins=10),
        y=alt.Y("Rotten_Tomatoes_Rating:Q").bin(maxbins=10),
        size=alt.Size("count():Q").scale(domain=[0,160])
    ).transform_filter(
        alt.datum["IMDB_Rating"] < threshold
    ),

    alt.Chart().mark_rule(color="gray").encode(
        strokeWidth=alt.StrokeWidth(value=6),
        x=alt.X(datum=alt.expr(threshold.name), type="quantitative")
    )
).add_params(threshold)

alt.JupyterChart(chart)

It initiates correctly but upon using the slider it errors with:

TraitError: The 'threshold' trait of a Params instance expected an int, not the float 5.3.

The Interval Selection on a Map is also not working with vegafusion enabled, but I don't think that is actually related to this PR.

@jonmmease
Copy link
Contributor Author

Thanks for the kind words and for trying out the PR @mattijn. I'll take a look!

@jonmmease
Copy link
Contributor Author

I fixed the param value error in 8689751 and opened vega/vegafusion#434 to track the error you hit in the interval selection on map example.

Is it possible to enter a kind of debug mode? Logging or printing what is being queried within vegafusion upon interacting with a visualisation?

This is a good idea. I'll add a verbose flag that logs out the variable values sent between the Python kernel and the widget.

(these will end up in the JupyterLab console)
@jonmmease
Copy link
Contributor Author

In 893bc59 I added a debug flag to JupyterChart. When it's True, the VegaFusion messages are printed, and in JupyterLab they end up in the log pane like this:

Screenshot 2023-12-13 at 5 55 50 PM

In the future we could log more things as well. See if that makes sense @mattijn

@joelostblom
Copy link
Contributor

Wow, so exciting to have this functionality directly in JupyterCharts! I'm in favor of this direction as I think this would make it more convenient to work with large data, and also simplify by having one less renderer and python package.

If I remember correctly, in the past we briefly talked about having a global option for enabling JupyterChart as the default for all Altair charts, similar to how we have the data_transformers right now. Would this still be viable after this PR, or would it become difficult now that there is an interaction with the global vegafusion option? In either case, I think the functionality in this PR is more helpful than the potential global option.

@jonmmease
Copy link
Contributor Author

If I remember correctly, in the past we briefly talked about having a global option for enabling JupyterChart as the default for all Altair charts

When we talked about this previously, I pictured adding a new renderer that would display charts using JupyterChart. Something like

alt.renderers.enable("jupyter")

This would be orthogonal to the VegaFusion data transformer used in this PR, so hypothetically you could do this to enable both:

alt.renderers.enable("jupyter")
alt.data_transformers.enable("vegafusion")

Does that make sense @joelostblom?

@joelostblom
Copy link
Contributor

Yup that makes sense; great that this works smoothly, I will check out the other PR you put up.

@mattijn
Copy link
Contributor

mattijn commented Dec 16, 2023

I'm not sure what I'm missing, but if I checkout the latest changes and double check that I'm on the right branch including latest commits:

!git log --oneline -5
ee2aed45 (HEAD -> jonmmease/vegafusion-widget, origin/jonmmease/vegafusion-widget) mypy fixes
893bc597 Add debug property and use this to enable printing VegaFusion messages (these will end up in the JupyterLab console)
8689751a Use float if initial param value is int
11a90569 Fix JupyterChart tests
ae373a9e bump vegafusion in pyproject.toml

I don't get a TraitError, using the spec from #3281 (comment). But the chart is not updating correctly and I don't see the debug logging in the log console.

Screen.Recording.2023-12-16.at.14.19.40.mov

@jonmmease
Copy link
Contributor Author

I'm not sure what I'm missing

You're not missing anything, I see what you mean now. I'll dig into why this isn't updating

@mattijn
Copy link
Contributor

mattijn commented Dec 16, 2023

By the way, does this mean that this PR makes it possible to set selections and update/stream data? I've the feeling this PR unlocks more possibilities than just integrating VegaFusion into the Altair JupyterChart. Or is this just wishful thinking?

@jonmmease
Copy link
Contributor Author

By the way, does this mean that this PR makes it possible to set selections and update/stream data? I've the feeling this PR unlocks more possibilities than just integrating VegaFusion into the Altair JupyterChart. Or is this just wishful thinking?

This PR makes it possible to update and listen to arbitrary signals and datasets in the Vega spec. So there are a lot more things that could be done with this. One example would be to update (though not stream) datasets in place. Streaming data (as in appending to the data that's already displayed) would require some additional work.

Once caveat is that it wouldn't (currently) work to combine the VegaFusion integration with other arbitrary updates to the widget's signals and datasets.

@jonmmease
Copy link
Contributor Author

I updated the Large Datasets documentation to describe using JupyterChart and remove mention of VegaFusionWidget.

@jonmmease
Copy link
Contributor Author

@mattijn, this is ready for another look. I just released VegaFusion 1.5.1 which fixes the two chart errors you ran into.

@mattijn
Copy link
Contributor

mattijn commented Dec 22, 2023

After updating vegafusion, I was getting this error:

AttributeError: 'builtins.PyVegaFusionRuntime' object has no attribute 'new_chart_state'

Checking versions of vegafusion:

(stable) D:\Software\altair-viz\altair>conda list vegafusion
# packages in environment at D:\Software\Miniconda3\envs\stable:
#
# Name                    Version                   Build  Channel
vegafusion                1.5.1              pyhd8ed1ab_0    conda-forge
vegafusion-python-embed   1.4.3           py310he2c049f_0    conda-forge

After updating vegafusion-python-embed as well, all works. Maybe we should introduce a check for this that these packages are in sync?

Thanks again @jonmmease!

@jonmmease
Copy link
Contributor Author

Maybe we should introduce a check for this that these packages are in sync?

Yeah, that's a good idea. I opened #3296 to track this.

I think I'll merge this on Monday if there isn't any more feedback. Thanks all!

Copy link
Contributor

@binste binste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The culmination of a lot of great work!! Thank you @jonmmease. This, together with the "jupyter" renderer makes it super easy for users to work with large datasets and explore them interactively. 🥳

Sorry that it took me a while to have a look, wasn't because I am not excited about this PR ;) I should have some more time over the holidays, just ping me when I can help out somewhere.

@jonmmease jonmmease merged commit ebf9da5 into main Dec 26, 2023
20 checks passed
@mattijn
Copy link
Contributor

mattijn commented Jan 22, 2024

Quick question @jonmmease. You made this statement before:

This PR makes it possible to update and listen to arbitrary signals and datasets in the Vega spec. So there are a lot more things that could be done with this. One example would be to update (though not stream) datasets in place.

I like to update a dataset in place. How does this work in practice? For example in this spec:

import altair as alt
import pandas as pd

list_start = ['C4', 'D4', 'E4', 'E4']
list_update = ['C4', 'D4', 'E4', 'E4', 'D#4', 'C4', 'G4', 'D#5', 'C5', 'G5', 'C4']

df_start = pd.DataFrame({'notes': list_start})
df_update = pd.DataFrame({'notes': list_update})

# Create a chart using Altair
bar_chart = alt.Chart(df_start).mark_bar().encode(
    x='notes:N',
    y='count():O'
)
jchart = alt.JupyterChart(bar_chart)
jchart

How can I update/replace the dataset here in place using the updated list list_update or dataframe df_update?

@jonmmease
Copy link
Contributor Author

Hi Mattijn, updating a dataset doesn't have a nice API yet, but using the primitives added in this PR you can do something like this:

jchart._py_to_js_updates = [{
    "namespace": "data",
    "scope": [],
    "name": "data-a0e7a86c692327a18bbeb2464725124c",
    "value": df_update.to_dict("records")
}]

Screenshot 2024-01-22 at 5 08 31 PM

Here "data-a0e7a86c692327a18bbeb2464725124c" is the auto-generated dataset name that I found by viewing the spec in the Vega editor.

namespace of "data" is in contrast to "signal", which can be used to update signals.

"scope" of [] means the dataset is defined at the top level of the compiled vega spec (not nested inside a group mark)

We could probably clean this up pretty well and add an jchart.update_data() method, at least for non-compound charts. We'd need to decide how to handle compound charts as well.

@mattijn
Copy link
Contributor

mattijn commented Jan 22, 2024

Wow! This will become great!
I connected a piano to a jupyter notebook and now I have the chart updating while playing some notes:

WIOD1ayTs3.mp4

For fun: my used code: https://gist.github.com/mattijn/5d44cd9e261b90c4c2b92bb0d19bc171

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Performance
Development

Successfully merging this pull request may close these issues.

5 participants