Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve serialization of Pandas DataFrames to ipyvega #346

Merged
merged 164 commits into from
Feb 12, 2023

Conversation

jdfekete
Copy link
Collaborator

With @xtianpoli, we have implemented an improved serialization of Pandas DataFrames for ipyvega.
It is not complete, we need to follow the rules of Altair for column type conversions, but at least, we have a noticeable speedup compared to the current version sending verbose json.

On the python side, for the VegaWidget, we have implemented a new traitlet type for a "Table". It is a dictionary of columns (a columnar representation) of a Pandas DataFrame (and potentially other tables) where we try to avoid copying anything, just point to low-level numpy arrays managed by Pandas, that can be serialized without copy using the buffer protocol.
Additionally, each column can be compressed with either zip or lz4, which boosts the transfer speed of columns.
On the other side, we translate the columnar table format into Vega internal tuples, again avoiding copies when possible.
Note that this serialization is only used by the streaming API since it requires using our traitlet in the VegaWidget, it cannot work inside a vega dataset.

Let us know if you disagree with our design choices.
There are a few possible pitfalls, such as sending multiple DataFrames, not supported (yet). If you see a clean way to do it, let us know.

We also provide some helpers for Altair, but we're not sure how to fully replace the standard Altair method to send data to the browser with ours. It would boil down, when a Altair-genenrated json spec is detected by the notebook, to wrap it with a VegaWidget and call update_dataframe on the Pandas DataFrame immediately after. If you can do that, then Altair would be boosted in a transparent way, able to support much larger datasets.

There are new notebooks to showcase the new capabilities and performances.

We did not use apache-arrow as an intermediary format since it would always make a copy, and since we want to handle large datasets, we want to avoid copying them in the first place.

Looking forwards to your comments, questions, and thoughts.
Best,
Jean-Daniel & Christian

@lgtm-com
Copy link

lgtm-com bot commented Jun 29, 2021

This pull request fixes 6 alerts when merging a814023 into 7cda844 - view on LGTM.com

fixed alerts:

  • 6 for Unused import

@domoritz
Copy link
Member

Thank you for the pull request. I think this is a good start but I would like to discuss a few options with you.

We did not use apache-arrow as an intermediary format since it would always make a copy, and since we want to handle large datasets, we want to avoid copying them in the first place.

Can you explain where these copies are happening with Arrow? When we use JSON, we already have to make a copy from pandas, no?

There are a few possible pitfalls, such as sending multiple DataFrames, not supported (yet). If you see a clean way to do it, let us know.

What do you mean you don't support multiple dataframes? I think it would be good if we could send data separately from the spec and support multiple datasets.

We also provide some helpers for Altair, but we're not sure how to fully replace the standard Altair method to send data to the browser with ours. It would boil down, when a Altair-genenrated json spec is detected by the notebook, to wrap it with a VegaWidget and call update_dataframe on the Pandas DataFrame immediately after. If you can do that, then Altair would be boosted in a transparent way, able to support much larger datasets.

Better support specifically for Altair is great. Have you adopted the idea of separating data from specs I implemented in https://github.com/streamlit/streamlit/blob/96f17464a13969ecdfe31043ab8e875718eb0d10/lib/streamlit/elements/altair.py#L315?

Additionally, each column can be compressed with either zip or lz4, which boosts the transfer speed of columns.

Does this have much benefit over transparent gzip compression over HTTP? How big is the overhead for compression/decompression and the additional compies we make when we compress data?

@jdfekete
Copy link
Collaborator Author

jdfekete commented Jul 1, 2021

We did not use apache-arrow as an intermediary format since it would always make a copy, and since we want to handle large datasets, we want to avoid copying them in the first place.

Can you explain where these copies are happening with Arrow? When we use JSON, we already have to make a copy from pandas, no?

No. For int and float columns, there is no copy:

import pandas as pd
from vega.widget import VegaWidget
from vega.dataframes.serializers import table_to_json
from vega.dataframes.pandas_adapter import PandasAdapter
w = VegaWidget('whatever')
df = pd.DataFrame({'a': [1,2,3], 'b': [1.5, 2.5, 3.5], 'c': ['a', 'b', 'c']})
adf = PandasAdapter(df)
table_to_json(adf, w)

{'columns': ['a', 'b', 'c'],
 'data': {'a': {'shape': (3,),
   'dtype': 'int32',
   'buffer': <memory at 0x7f5e654b4040>},
  'b': {'shape': (3,),
   'dtype': 'float64',
   'buffer': <memory at 0x7f5e654b4340>},
  'c': {'shape': (3,), 'dtype': 'str', 'buffer': ['a', 'b', 'c']}}}

There are a few possible pitfalls, such as sending multiple DataFrames, not supported (yet). If you see a clean way to do it, let us know.

What do you mean you don't support multiple dataframes? I think it would be good if we could send data separately from the spec and support multiple datasets.

Currently, our proof of concept is based on the streaming API and we only send one dataframe at a time with the update_dataframe method. This can be extended once we agree on the underlying mechanisms.

We also provide some helpers for Altair, but we're not sure how to fully replace the standard Altair method to send data to the browser with ours. It would boil down, when a Altair-genenrated json spec is detected by the notebook, to wrap it with a VegaWidget and call update_dataframe on the Pandas DataFrame immediately after. If you can do that, then Altair would be boosted in a transparent way, able to support much larger datasets.

Better support specifically for Altair is great. Have you adopted the idea of separating data from specs I implemented in https://github.com/streamlit/streamlit/blob/96f17464a13969ecdfe31043ab8e875718eb0d10/lib/streamlit/elements/altair.py#L315?

No. Thanks for pointing to this mechanism, I will see how we can use it with our mechanism. Our examples use a similar but less flexible mechanism.

Additionally, each column can be compressed with either zip or lz4, which boosts the transfer speed of columns.

Does this have much benefit over transparent gzip compression over HTTP? How big is the overhead for compression/decompression and the additional compies we make when we compress data?

Yes it does, see: https://lemire.me/blog/2021/06/30/compressing-json-gzip-vs-zstd/
Basically, using lz4 (used by zstd) instead of gzip is faster to compress, even more to decompress, and more efficient overall. See also the page on zstd: https://facebook.github.io/zstd/

The compression "codecs" (such as lz4 or zlib) should be part of the library and pre-selected for casual users. If you know the distribution profile of your data column, a specific codec can really make a huge difference (See e.g. https://github.com/lemire/FastPFor). Standard zip compression used in HTTP is not efficient and flexible enough to accommodate data characteristics.

@lgtm-com
Copy link

lgtm-com bot commented Jul 1, 2021

This pull request fixes 6 alerts when merging 283fe75 into 7cda844 - view on LGTM.com

fixed alerts:

  • 6 for Unused import

@domoritz
Copy link
Member

domoritz commented Jul 1, 2021

Thanks for the notes. I think I would personally still prefer Arrow since it encodes pretty efficiently and is well supported. It will make it a lot easier to maintain ipyvega in the future if we don't roll our own serialization.

@domoritz
Copy link
Member

domoritz commented Jan 30, 2023

Can you merge master to make this pull request ready? In particular, we should not be updating Vega-Lite in this pull request as Altair is still on Vega-Lite 4 for now and I want to coordinate the update with them.

Copy link
Member

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pull request. I would love to get this in but it's not quite ready yet.

  • What is .gitmodules?
  • Address all the comments in this pull request.

README.md Show resolved Hide resolved
package.json Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
src/widget.ts Outdated Show resolved Hide resolved
src/widget.ts Outdated Show resolved Hide resolved
src/widget.ts Outdated Show resolved Hide resolved
src/widget.ts Outdated Show resolved Hide resolved
src/widget.ts Show resolved Hide resolved
src/widget.ts Outdated Show resolved Hide resolved
src/widget.ts Outdated Show resolved Hide resolved
@domoritz
Copy link
Member

Thank you for making the updates @xtianpoli! Let me know when you are done and want me to make another review.

@jdfekete
Copy link
Collaborator Author

All the comments have been addressed.
The PR speeds up the transfer of Pandas dataframes and avoids copying Pandas data in most cases.
Vega/Altair can now be used with reasonable dataframe sizes (100k-10m rows depending of the columns).
We have added more tests, end-to-end tests, especially with Galata, to test that the rendering of the streaming API is identical to the classical API.
Notebooks, when saved with their widget states, show the visualization images when reloaded. Instead of serializing the whole data (which would be too large and hard to compute right when the streaming API is used), the VegaWidget saves the image of the Vega rendering. The notebook is smaller; the result is faster to load, and identical to the appearance of the saved widget with a minor difference in behavior: the user cannot interact with a saved cell because it is not a Vega object; it needs to be recomputed from Python to become a Vega object again.
As promised, ipyvega can now be used with larger dataframes and streaming and progressive applications.

@jdfekete jdfekete requested a review from domoritz January 31, 2023 21:25
@domoritz
Copy link
Member

I want to review this but somehow my python setup is botched and now I run into #418 (comment). Stay tuned.

@domoritz domoritz merged commit bd93616 into vega:master Feb 12, 2023
@domoritz
Copy link
Member

Thank you! I also added you as maintainers of this repo so you can triage issues in the issue tracker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants