-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanitize timestamps in arrow tables #3076
Conversation
Thanks for pushing this @jonmmease! Would be interesting to see if we slowly can phase-in the dataframe interchange protocol for all dataframe-a-likes (and simultaneously phase-out the dependency on pandas). Can you move the Would it also be possible to include a test including a By the way, on windows the test is not passing. It gives the following error: ArrowInvalid: Cannot locate timezone 'UTC': Timezone database not found at D:\Users\Hoek\Downloads\tzdata See also apache/arrow#31472 and apache/arrow#35637 (comment). That probably require some thought how we approach this. |
Done in 58f11ba
Done in 9c6ea83
Yeah, that's an interesting possibility. I'm a bit leery of adding a required dependency on pyarrow since it's not available everywhere pandas is (e.g. pyodide). But the pandas project itself is currently debating whether to take on pyarrow as a required dependency anyway, so that might be the future direction the ecosystem goes: pandas-dev/pandas#52509, pandas-dev/pandas#52711 |
I agree. I think it is unfortunate that we, for now, have to rely on pyarrow to get support for the current implementation of the dataframe interchange protocol. If there is a WASM friendly option to support the dataframe interchange protocol, I would favor that one. |
Given that this is a known upstream pyarrow issue, are you comfortable with this being merged? |
Yes not a blocker from my side. Also noticed that wasm support for pyarrow is something that is actively worked on, see: pyodide/pyodide#2933, that is positive. Haven't had the chance to test this PR, which I would like to do, please give me a few more days. |
Sure thing. Thanks for taking a look. I was thinking that if the test doesn't work on windows, maybe we should skip it when the current platform is windows. |
Getting this Can you also have a look to this comment, which states:
Is this something we have to consider for here? Docs for duration: https://arrow.apache.org/docs/python/generated/pyarrow.duration.html. Also related, import altair as alt
import pandas as pd
import pyarrow as pa
from altair.utils.data import to_values
td = pd.timedelta_range(0,periods=3,freq='h')
df = pd.DataFrame(td).reset_index()
df.columns = ['id', 'timedelta']
pa_table = pa.table(df)
values = to_values(pa_table)
values
alt.Chart(pa_table).mark_bar().encode(
x='timedelta:O',
y='id:O'
)
See also #974 |
Thank for taking a look! For timedelta / duration, there isn't an equivalent Vega type, so I think the best we can do is raise a more informative error. I'll update the sanitize arrow function to do this.
Will do, thanks for the references |
Changes made. @mattijn could you try out the windows test skipping logic on your machine? |
Quick @jonmmease! Yes, it is working as intended. Nice! Once tests on GA pass as well we can merge this in👍. |
Thanks for the review @mattijn, test are passing. Merging! |
…ndas and into Polars. As of now this requires an unreleased change to Vega-Altair (which we pull from Git): vega/altair#3076 The `EncounterLag` chart is still on Pandas because it uses its `wide_to_long` function that doesn't exist in Polars. Also `MolecularCurrentDeltas` is looking funny, fix later
Closes #3050 by converting timestamps columns in Arrow tables to ISO-8601 strings before JSON serialization.
I added a test, though it doesn't look like we had any existing test coverage for the DataFrame interchange protocol support.
cc @djouallah