-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected behavior in update()
#123
Comments
Good catch @dispanser |
Thanks @dispanser for the detailed analysis log :) |
@dispanser - re:
Status of write support is very early (we just have the bits you've seen already regarding the transaction log so far), and checkpointing is not implemented at all yet. See #106 for the issue tracking checkpointing. |
I believe there's a subtle bug in the
update()
method, and how it decides oneither loading from a checkpoint or applying the log from json files.
Here's the important bits:
What this is doing is
I believe what we want instead is use the json deltas if the new checkpoint is the
same as the previous one, and use the checkpoint if it is newer than what we previously
had loaded.
Note that the final result - an up to date delta table - is still achieved, but it
unnecessarily loads checkpoints or json deltas in either scenario, so it could be
more efficient.
I wasn't able to write a self-contained test inside rust, as I'm not sure about
the status of write support (and checkpointing in particular), but I validated
my assumptions by running a spark shell session to generate commits and a rust
session (with some
println!
sparkles) side by side, based on the following"data generator" in scala:
createCommits(5, "<some delta path>")
Init: loading the table, all good.
Expected behavior: start with our current version and apply new comits in sequence.
Expected behavior: load checkpoint, apply changes from there
Unexpected behavior: we already have read until version 12, but we reload from
checkpoint at version 10, applying more json than necessary (and a checkpoint
that does not help, either).
Unexpected behavior: we reload starting from the previous checkpoint, even though
now new commits where added to the delta log.
Unexpected behavior: despite having a checkpoint at 20, we use json all
the way up for versions 16 .. 20.
I believe the expected behavior can be achieved with a single-character change
in the logical expression. I'd gladly provide a pull request.
The text was updated successfully, but these errors were encountered: