-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Support writer protocol V2 in write_deltalake #575
Comments
@GraemeCliffe-inspirato Are you still interested in taking the append only part of this issue? If not, I may be able to look into it. If so I'll hold off and see if there's other good first issues for me on this :) |
@PadenZach I'm still interested in the appendOnly section! Thank you for checking! I have just now got the project building and tests running, so I'm just starting on the issue |
My initial guess on what needs to be done on invariants is something like this: # ...inside of write_deltalake
def iter_batches(data, invariants) -> Iterator[RecordBatch]:
for batch in data:
for sql_clause in invariants:
res = execute(sql_clause, batch)
if res != True:
raise ValueError("Invariant violated: ...")
yield batch
invariants = configuration["delta.invariants"]
data = convertToRecordBatchReader(data)
batch_iter = iter_batches(data, invariants)
# ... pass batch_iter to write_dataset The hard part is we don't have a SQL parser in the deltalake package, so not sure how that It's probably worth researching what typical invariants are used and allowed by existing engines. The spec is very vague, but it's likely the Spark implementation has a limited set of column types and operations we need to care about. |
# Description Adds support to retrieve invariants from the Delta schema and also a struct `DeltaDataChecker` to use DataFusion to check them and report useful errors. This also hooks it up to the Python bindings, allowing `write_deltalake()` to support Writer Protocol V2. I looked briefly at the Rust writer, but then realized we don't want to introduce a dependency on DataFusion. We should discuss how we want to design that API. I suspect we'll turn DeltaDataChecker into a trait, so we can have a DataFusion one available but also allow other engines to implement it themselves if they don't wish to use DataFusion. # Related Issue(s) - closes #592 - closes #575 # Documentation https://github.com/delta-io/delta/blob/master/PROTOCOL.md#column-invariants
Description
Most delta tables are now V2 by default, so writer isn't yet compatible with most systems.
Two features needed are:
delta.appendOnly
. Just need to check the delta config, check themode
parameter, and throw an error if needed. (Python API - delta.appendOnly enforcement #590)enforce_invariant()
function #592)Related Issue(s)
Umbrella issue:#542
The text was updated successfully, but these errors were encountered: