Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.3 write to branch snapshot #6651
Spark 3.3 write to branch snapshot #6651
Changes from 4 commits
9e8bf34
ee4cadb
3225506
e1dfa45
58b4bf2
8677134
af17f25
7642b9e
da9dcc0
ca8e1ff
2e4eefe
de20c76
85d7475
bbf57e3
0e081e1
51b1052
aa42e2e
03c962d
bed5ec3
332064e
6ef5f4e
8ecfdcd
6b8f954
f8b34bd
a8a5d89
7ee1689
64db07e
1b2cd5a
4c94693
2f3d6e1
9bbed3a
51a29b3
b2692fe
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this no longer valid? I think that we do not want to write to a specific snapshot. Is branch somehow passed as the snapshot ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking into this a bit more, I think this is incorrect. The
snapshotId
is set when the table is loaded using time travel syntax. I don't think that we want to allow that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue Can we add more checks that if the snapshot Id is the tip of the branch, then writing to branch is supported ?
If its the tip of the branch, then spark write should be supported.
I believe when we do
spark...save(table);
We are calling
catalog.loadtable(ident)
In DataFrameWriter.
When passing
spark..option("branch","..")
the snapshotId() is getting set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this isn't an issue. I reverted this change and ran
TestSparkDataWrite
and everything passes. Let's revert this and run CI. If there are other issues outside of that test class, I'll take a look.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue @amogh-jahagirdar if bug fix for read by snapshot ref gets merged #6717, then write to branch snapshot will fail as per test TestDeleteFrom.java That's because of the above condition. If feel we have to tweak the condition if this is going to be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it seems the issue is that
catalog.loadTable(table)
interprets the branch option as the branch read option (because both are called "branch" and we have to load the table before doing the write, it can't differentiate if it's for write or not) couldn't we just have a different config name when doing writes?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@namrathamyske Yeah just updated to use the name
write-branch
and tests are passing. The issue is the name 'branch' is used for both read and write options, and when the loadTable is performed when doing the write , it treats it as a time travel. we should disambiguate the two. I think we should actually call it something else for the write case.write-branch
kinda sounds odd to me tbh, maybe we go withtoBranch
. toBranch would be consistent with what's at the API and what's being done in the Flink PR. But we don't necessarily need to have parity there, whatever is the spark convention for naming and makes sense for users. @aokolnychyi @rdblue any suggestions there?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But i think we can't disregard calling loadTable wrt to ref passed. Later in future when we implement session configs for testing
INSERT
DELETE
operations, there is lot of overlap between read and write. Spark logical plans call theSparkScanBuilder
iceberg/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java
Line 260 in 32a8ef5
iceberg/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java
Line 424 in 32a8ef5
iceberg/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java
Line 393 in 32a8ef5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @namrathamyske , I was a bit short sighted we actually do want to leverage the statistics for the specific snapshot for writes. These statistics would be used during the scan itself (for example MERGE INTO branch) . So either we 1.) seek a good way to differentiate between a time travel query where the write shouldn't be able to be applied and an intentional write on a branch or 2.) we just relax the check that a snapshot is set as you did earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue @amogh-jahagirdar @jackye1995 this is still an open item for this PR get merged. I would prefer to go with second option. But let me know otherwise!