Improve serialization of Pandas DataFrames to ipyvega #346

jdfekete · 2021-06-29T20:41:21Z

With @xtianpoli, we have implemented an improved serialization of Pandas DataFrames for ipyvega.
It is not complete, we need to follow the rules of Altair for column type conversions, but at least, we have a noticeable speedup compared to the current version sending verbose json.

On the python side, for the VegaWidget, we have implemented a new traitlet type for a "Table". It is a dictionary of columns (a columnar representation) of a Pandas DataFrame (and potentially other tables) where we try to avoid copying anything, just point to low-level numpy arrays managed by Pandas, that can be serialized without copy using the buffer protocol.
Additionally, each column can be compressed with either zip or lz4, which boosts the transfer speed of columns.
On the other side, we translate the columnar table format into Vega internal tuples, again avoiding copies when possible.
Note that this serialization is only used by the streaming API since it requires using our traitlet in the VegaWidget, it cannot work inside a vega dataset.

Let us know if you disagree with our design choices.
There are a few possible pitfalls, such as sending multiple DataFrames, not supported (yet). If you see a clean way to do it, let us know.

We also provide some helpers for Altair, but we're not sure how to fully replace the standard Altair method to send data to the browser with ours. It would boil down, when a Altair-genenrated json spec is detected by the notebook, to wrap it with a VegaWidget and call update_dataframe on the Pandas DataFrame immediately after. If you can do that, then Altair would be boosted in a transparent way, able to support much larger datasets.

There are new notebooks to showcase the new capabilities and performances.

We did not use apache-arrow as an intermediary format since it would always make a copy, and since we want to handle large datasets, we want to avoid copying them in the first place.

Looking forwards to your comments, questions, and thoughts.
Best,
Jean-Daniel & Christian

…ting the console. Add python doc.

lgtm-com · 2021-06-29T20:47:47Z

This pull request fixes 6 alerts when merging a814023 into 7cda844 - view on LGTM.com

fixed alerts:

6 for Unused import

domoritz · 2021-06-29T22:52:43Z

Thank you for the pull request. I think this is a good start but I would like to discuss a few options with you.

We did not use apache-arrow as an intermediary format since it would always make a copy, and since we want to handle large datasets, we want to avoid copying them in the first place.

Can you explain where these copies are happening with Arrow? When we use JSON, we already have to make a copy from pandas, no?

There are a few possible pitfalls, such as sending multiple DataFrames, not supported (yet). If you see a clean way to do it, let us know.

What do you mean you don't support multiple dataframes? I think it would be good if we could send data separately from the spec and support multiple datasets.

We also provide some helpers for Altair, but we're not sure how to fully replace the standard Altair method to send data to the browser with ours. It would boil down, when a Altair-genenrated json spec is detected by the notebook, to wrap it with a VegaWidget and call update_dataframe on the Pandas DataFrame immediately after. If you can do that, then Altair would be boosted in a transparent way, able to support much larger datasets.

Better support specifically for Altair is great. Have you adopted the idea of separating data from specs I implemented in https://github.com/streamlit/streamlit/blob/96f17464a13969ecdfe31043ab8e875718eb0d10/lib/streamlit/elements/altair.py#L315?

Additionally, each column can be compressed with either zip or lz4, which boosts the transfer speed of columns.

Does this have much benefit over transparent gzip compression over HTTP? How big is the overhead for compression/decompression and the additional compies we make when we compress data?

jdfekete · 2021-07-01T06:55:11Z

We did not use apache-arrow as an intermediary format since it would always make a copy, and since we want to handle large datasets, we want to avoid copying them in the first place.

Can you explain where these copies are happening with Arrow? When we use JSON, we already have to make a copy from pandas, no?

No. For int and float columns, there is no copy:

import pandas as pd
from vega.widget import VegaWidget
from vega.dataframes.serializers import table_to_json
from vega.dataframes.pandas_adapter import PandasAdapter
w = VegaWidget('whatever')
df = pd.DataFrame({'a': [1,2,3], 'b': [1.5, 2.5, 3.5], 'c': ['a', 'b', 'c']})
adf = PandasAdapter(df)
table_to_json(adf, w)

{'columns': ['a', 'b', 'c'],
 'data': {'a': {'shape': (3,),
   'dtype': 'int32',
   'buffer': <memory at 0x7f5e654b4040>},
  'b': {'shape': (3,),
   'dtype': 'float64',
   'buffer': <memory at 0x7f5e654b4340>},
  'c': {'shape': (3,), 'dtype': 'str', 'buffer': ['a', 'b', 'c']}}}

There are a few possible pitfalls, such as sending multiple DataFrames, not supported (yet). If you see a clean way to do it, let us know.

What do you mean you don't support multiple dataframes? I think it would be good if we could send data separately from the spec and support multiple datasets.

Currently, our proof of concept is based on the streaming API and we only send one dataframe at a time with the update_dataframe method. This can be extended once we agree on the underlying mechanisms.

We also provide some helpers for Altair, but we're not sure how to fully replace the standard Altair method to send data to the browser with ours. It would boil down, when a Altair-genenrated json spec is detected by the notebook, to wrap it with a VegaWidget and call update_dataframe on the Pandas DataFrame immediately after. If you can do that, then Altair would be boosted in a transparent way, able to support much larger datasets.

Better support specifically for Altair is great. Have you adopted the idea of separating data from specs I implemented in https://github.com/streamlit/streamlit/blob/96f17464a13969ecdfe31043ab8e875718eb0d10/lib/streamlit/elements/altair.py#L315?

No. Thanks for pointing to this mechanism, I will see how we can use it with our mechanism. Our examples use a similar but less flexible mechanism.

Additionally, each column can be compressed with either zip or lz4, which boosts the transfer speed of columns.

Does this have much benefit over transparent gzip compression over HTTP? How big is the overhead for compression/decompression and the additional compies we make when we compress data?

Yes it does, see: https://lemire.me/blog/2021/06/30/compressing-json-gzip-vs-zstd/
Basically, using lz4 (used by zstd) instead of gzip is faster to compress, even more to decompress, and more efficient overall. See also the page on zstd: https://facebook.github.io/zstd/

The compression "codecs" (such as lz4 or zlib) should be part of the library and pre-selected for casual users. If you know the distribution profile of your data column, a specific codec can really make a huge difference (See e.g. https://github.com/lemire/FastPFor). Standard zip compression used in HTTP is not efficient and flexible enough to accommodate data characteristics.

lgtm-com · 2021-07-01T13:20:20Z

This pull request fixes 6 alerts when merging 283fe75 into 7cda844 - view on LGTM.com

fixed alerts:

6 for Unused import

domoritz · 2021-07-01T17:39:04Z

Thanks for the notes. I think I would personally still prefer Arrow since it encodes pretty efficiently and is well supported. It will make it a lot easier to maintain ipyvega in the future if we don't roll our own serialization.

domoritz · 2023-01-30T02:16:21Z

Can you merge master to make this pull request ready? In particular, we should not be updating Vega-Lite in this pull request as Altair is still on Vega-Lite 4 for now and I want to coordinate the update with them.

domoritz

Thank you for the pull request. I would love to get this in but it's not quite ready yet.

What is .gitmodules?
Address all the comments in this pull request.

README.md

package.json

pyproject.toml

src/widget.ts

domoritz · 2023-01-31T20:14:24Z

Thank you for making the updates @xtianpoli! Let me know when you are done and want me to make another review.

jdfekete · 2023-01-31T21:24:30Z

All the comments have been addressed.
The PR speeds up the transfer of Pandas dataframes and avoids copying Pandas data in most cases.
Vega/Altair can now be used with reasonable dataframe sizes (100k-10m rows depending of the columns).
We have added more tests, end-to-end tests, especially with Galata, to test that the rendering of the streaming API is identical to the classical API.
Notebooks, when saved with their widget states, show the visualization images when reloaded. Instead of serializing the whole data (which would be too large and hard to compute right when the streaming API is used), the VegaWidget saves the image of the Vega rendering. The notebook is smaller; the result is faster to load, and identical to the appearance of the saved widget with a minor difference in behavior: the user cannot interact with a saved cell because it is not a Vega object; it needs to be recomputed from Python to become a Vega object again.
As promised, ipyvega can now be used with larger dataframes and streaming and progressive applications.

domoritz · 2023-02-12T16:46:57Z

I want to review this but somehow my python setup is botched and now I run into #418 (comment). Stay tuned.

domoritz · 2023-02-12T17:18:30Z

Thank you! I also added you as maintainers of this repo so you can triage issues in the issue tracker.

xtianpoli and others added 24 commits June 7, 2021 19:55

data transfer via ipydatawidgets

01aab0d

fix

162a38e

Fix heatmap for speed

d1989f7

Fix heatmap for speed

8813074

init ipydatatablewidget

28ba8e9

using NDArray instead of DataUnion

44efd02

adding a customized table traitlet

81e90a4

src/widget.ts

94c7077

fix

de25ee6

cleanup

69c689f

adding lz4js decompression

88e5138

fix

c2f33c9

make NumpyAdapter.equals returns always False

5d58cba

Remove most console.log and add %time and progress bars in notebooks

21e9bea

fixing .gitignore

803a0cb

Add stress tests

3badb98

rename update_histogram2d to update_array2d. Remove console.log pollu…

b198c04

…ting the console. Add python doc.

Improve

62eaa47

Add a definition file to configure Altair. More work needed

aa0b328

compression as string

c856d7a

fix (naive approach)

258e0aa

adding an adapter for Progressivis tables

0d6c5d3

Merge branch 'vega:master' into master

5498083

cleanup

a814023

adding a touch mode

283fe75

joelostblom mentioned this pull request Jan 6, 2023

Improve serialization of Pandas DataFrames to ipyvega vega/altair#2471

Closed

3 tasks

xtianpoli and others added 11 commits January 10, 2023 15:59

save widgets rendering as images

c6f6b43

using poetry 1.3.1

e1b7a58

minor improvement

82b6899

serialize static widgets as json

71fc3ef

Regenerate poetry lock file and fix warning in tests

d5d116b

Update packages and try saving a streamed example

bfc96f6

setting rendered images dims

dacf4fb

back to @jupyter-widgets/base=4.1.0 because of jupyterlab

6e5fc40

fix

9a067f3

new fixes+upgraded dependencies

b0493bb

upgrade dev dependency filemanager-webpack-plugin => 7.0.0

e70423f

domoritz requested changes Jan 30, 2023

View reviewed changes

xtianpoli added 4 commits January 30, 2023 09:41

merge upstream

c930fc8

rm .gitmodules+update poetry.loc

baca3f0

using barley.json instead of barley.csv in all cells

ba96399

improvements

3adb15a

jdfekete added 2 commits January 31, 2023 21:46

Improve documentation to function stream

10714a0

Fix docstring

d618580

jdfekete requested a review from domoritz January 31, 2023 21:25

domoritz mentioned this pull request Feb 12, 2023

vega_datasets should be a dev dependency #454

Closed

domoritz merged commit bd93616 into vega:master Feb 12, 2023

mattijn mentioned this pull request Feb 12, 2023

Support for pola.rs DataFrames vega/altair#2868

Closed

3 tasks

domoritz mentioned this pull request Feb 12, 2023

Improve serialization of Pandas DataFrames to ipyvega #345

Closed

joelostblom mentioned this pull request Jul 14, 2023

RFC: Add official Jupyter widget with AnyWidget vega/altair#3106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve serialization of Pandas DataFrames to ipyvega #346

Improve serialization of Pandas DataFrames to ipyvega #346

jdfekete commented Jun 29, 2021

lgtm-com bot commented Jun 29, 2021

domoritz commented Jun 29, 2021

jdfekete commented Jul 1, 2021 •

edited

Loading

lgtm-com bot commented Jul 1, 2021

domoritz commented Jul 1, 2021

domoritz commented Jan 30, 2023 •

edited

Loading

domoritz left a comment •

edited

Loading

domoritz commented Jan 31, 2023

jdfekete commented Jan 31, 2023

domoritz commented Feb 12, 2023

domoritz commented Feb 12, 2023

Improve serialization of Pandas DataFrames to ipyvega #346

Improve serialization of Pandas DataFrames to ipyvega #346

Conversation

jdfekete commented Jun 29, 2021

lgtm-com bot commented Jun 29, 2021

domoritz commented Jun 29, 2021

jdfekete commented Jul 1, 2021 • edited Loading

lgtm-com bot commented Jul 1, 2021

domoritz commented Jul 1, 2021

domoritz commented Jan 30, 2023 • edited Loading

domoritz left a comment • edited Loading

Choose a reason for hiding this comment

domoritz commented Jan 31, 2023

jdfekete commented Jan 31, 2023

domoritz commented Feb 12, 2023

domoritz commented Feb 12, 2023

jdfekete commented Jul 1, 2021 •

edited

Loading

domoritz commented Jan 30, 2023 •

edited

Loading

domoritz left a comment •

edited

Loading