Python: Support writer protocol V2 in write_deltalake #575

wjones127 · 2022-03-23T02:52:08Z

Description

Most delta tables are now V2 by default, so writer isn't yet compatible with most systems.

Two features needed are:

Add support for enforcing delta.appendOnly. Just need to check the delta config, check the mode parameter, and throw an error if needed. (Python API - delta.appendOnly enforcement #590)
Add support for enforcing column invariants. This part is more complex. Need to parse the SQL expression into a PyArrow expression, evaluate that expression on the data, and error as soon an invalid data is found. (Write enforce_invariant() function #592)

Related Issue(s)
Umbrella issue:#542

The text was updated successfully, but these errors were encountered:

PadenZach · 2022-04-16T01:45:51Z

@GraemeCliffe-inspirato Are you still interested in taking the append only part of this issue? If not, I may be able to look into it. If so I'll hold off and see if there's other good first issues for me on this :)

WarSame · 2022-04-18T01:10:10Z

@PadenZach I'm still interested in the appendOnly section! Thank you for checking! I have just now got the project building and tests running, so I'm just starting on the issue

wjones127 · 2022-04-22T05:40:04Z

My initial guess on what needs to be done on invariants is something like this:

# ...inside of write_deltalake
def iter_batches(data, invariants) -> Iterator[RecordBatch]:
    for batch in data:
        for sql_clause in invariants:
           res = execute(sql_clause, batch)
           if res != True:
               raise ValueError("Invariant violated: ...")
           yield batch

invariants = configuration["delta.invariants"]
data = convertToRecordBatchReader(data)

batch_iter = iter_batches(data, invariants)
# ... pass batch_iter to write_dataset

The hard part is we don't have a SQL parser in the deltalake package, so not sure how that execute() function would work. One option is to turn on the datafusion delta-rs option in Python (which I suspect we might do eventually anyways) and then implement execute() in Rust using datafusion. Moving the record batch temporarily into Rust with zero-copy should be possible (that's used by python-datafusion), but might take a little bit of glue code.

It's probably worth researching what typical invariants are used and allowed by existing engines. The spec is very vague, but it's likely the Spark implementation has a limited set of column types and operations we need to care about.

# Description Adds support to retrieve invariants from the Delta schema and also a struct `DeltaDataChecker` to use DataFusion to check them and report useful errors. This also hooks it up to the Python bindings, allowing `write_deltalake()` to support Writer Protocol V2. I looked briefly at the Rust writer, but then realized we don't want to introduce a dependency on DataFusion. We should discuss how we want to design that API. I suspect we'll turn DeltaDataChecker into a trait, so we can have a DataFusion one available but also allow other engines to implement it themselves if they don't wish to use DataFusion. # Related Issue(s) - closes #592 - closes #575 # Documentation https://github.com/delta-io/delta/blob/master/PROTOCOL.md#column-invariants

wjones127 added the enhancement New feature or request label Mar 23, 2022

wjones127 mentioned this issue Apr 7, 2022

Python PyArrow Dataset Writer #542

Closed

WarSame mentioned this issue Apr 22, 2022

Python API - delta.appendOnly enforcement #590

Merged

wjones127 mentioned this issue Apr 29, 2022

Write enforce_invariant() function #592

Closed

wjones127 mentioned this issue Sep 24, 2022

Add invariant enforcement support #834

Merged

wjones127 closed this as completed in #834 Sep 28, 2022

MrPowers mentioned this issue Oct 4, 2022

Roadmap 2022 H2 (discussion) delta-io/delta#1307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Support writer protocol V2 in write_deltalake #575

Python: Support writer protocol V2 in write_deltalake #575

wjones127 commented Mar 23, 2022 •

edited

Loading

PadenZach commented Apr 16, 2022

WarSame commented Apr 18, 2022

wjones127 commented Apr 22, 2022

Python: Support writer protocol V2 in write_deltalake #575

Python: Support writer protocol V2 in write_deltalake #575

Comments

wjones127 commented Mar 23, 2022 • edited Loading

Description

PadenZach commented Apr 16, 2022

WarSame commented Apr 18, 2022

wjones127 commented Apr 22, 2022

wjones127 commented Mar 23, 2022 •

edited

Loading