-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: read schema from parquet files in datafusion scans #1266
Conversation
/// | ||
/// This will construct a schema derived from the parqet schema of the latest data file, | ||
/// and fields for partition columns from the schema defined in table meta data. | ||
pub async fn physical_arrow_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can guarantee this schema is consistent across all parquet files in the table; different writers may have written to the table with different physical types for timestamps. IMO this should be handled in the scan of each Parquet file. That is, we should cast the physical type to microsecond timestamps as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PyArrow, we handle the int96 timestamp issue by passing an argument to the reader to coerce it to microsecond precision. Maybe we could implement something similar upstream?
delta-rs/python/tests/test_table_read.py
Line 32 in 34d43b6
parquet_read_options=ParquetReadOptions(coerce_int96_timestamp_unit="ms") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There definitely are no guarantees as to the file schema being consistent. Datafusion however needs a consistent schema. Once we get into column mappings etc, things might get even more demanding and we may have to roll our own parquet scan, or rather start putting logic into our DeltaScan.
That said, I do believe using the schema from the latest file is an improvement over the current way, which at least for me fails for more or less every databricks written table where there are timestamps involved.
Not sure about the best way forward, but I'm happy to keep that logic on a private branch somewhere until we have a more general fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhat related, so I already published it as WIP - in #1267 I did some work on the write command. There my plan was to use the same schema to validates writes. But there it would be even more confusing, since we might end up on situation, where writing the "official" schema of the chart would not be permissible. But somehow it feels very strange to me to have potentially many schemas in the same table.
i guess spark must allow at least some flexibility in what schema it expects at write time, otherwise how would we end up in this discussion at all :D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we are definitely hitting the limits of DataFusion's scanner. I've created an issue upstream apache/datafusion#5950
I'm fine with moving this forward; I mostly care that we have a more robust implementation in the future and have at least some momentum towards it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished looking through. Just one other comment.
.files() | ||
.iter() | ||
.max_by_key(|obj| obj.modification_time) | ||
.ok_or(DeltaTableError::Generic("No active file actions to get physical schema. Maybe the current state has not yet been loaded?".into()))? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this error propagate to the user? Does this mean trying to scan an empty tables leads to an error? I don't think it should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does. at the time I though we could fail on a scan if no files have been added yet, but you are right there are several valid scenarios where we have no files in a table and still should be able to do a scan.
Fixed it so we a re falling back to the schema from metadata.
Co-authored-by: Will Jones <willjones127@gmail.com>
Description
This PR updates table scans with datafusion to read the file schema from the parquet file within the latest add action of the table. This is to work around some issues, where the schema we derive from metadata does not match the data in the parquet files - e.g. nanosecond timestamps vs. micorsoecond.
We also update the
Load
command to handle column selections and make it more consistent with the other operations.Related Issue(s)
closes #441
Documentation