-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When partitioned, partition might lose the missingness eltype (in Tables.schema) #3298
Comments
I will add it. Note that currently it is assumed that you use
|
I will open #3299 in a few minutes fixing this:
(note that you should not use
(at least in DataFrames.jl - without checking |
Thanks for the tip! The parent was merely an observation / quick fix. The call I think the design relies on this statement in the docs for Tables.partitions:
But, unfortunately, the information gets lost if users pre-partition their DataFrame based on row chunks (in the pre-1.5.0 world) with Tables.rows EDIT: Basically, the intermediate call to "Tables.columns" before Arrow serializes data is what is causing all this pain. Because when DataFrameRows are passed to it, it just materializes them as they are (and the parent schema is lost) |
Problem: If a user partitions a DataFrame with
Iterators.partition(Tables.rows(df), 2)
, the correspondingTables.schema
for each partition will not be type-stableEg, if the first partition does not have any missing information, it would not have the missing type in its schema despite the overall vector having missingness allowed.
Why it's a problem: Arrow.write determines the schema from the first record batch, because Tables-compatible sources retain parents' schema information even for partitions. In effect, missing fields could be lost and replaced by empty fields of the concrete type (eg, "" for String).
Example
It might be expected, I'm not sure what's required by Tables interface. But I'm opening for visibility, the original issue is in apache/arrow-julia#403 , because that's where the downstream error happens.
Versioninfo:
The text was updated successfully, but these errors were encountered: