-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Add DataFrame.write_iceberg
#15018
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #15018 +/- ##
=======================================
Coverage 79.88% 79.89%
=======================================
Files 1513 1513
Lines 203466 203471 +5
Branches 2892 2893 +1
=======================================
+ Hits 162546 162555 +9
+ Misses 40372 40369 -3
+ Partials 548 547 -1 ☔ View full report in Codecov by Sentry. |
waiting for pyiceberg 0.6.1 release, should be released soon, which will fix the schema issue described above |
@kevinliu-stripe fyi 0.6.1 has been released now :) https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.6.1 |
9bff9b8
to
0801012
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. A few comments.
@@ -3834,6 +3835,32 @@ def unpack_table_name(name: str) -> tuple[str | None, str | None, str]: | |||
msg = f"unrecognised connection type {connection!r}" | |||
raise TypeError(msg) | |||
|
|||
def write_iceberg( | |||
self, | |||
table: pyiceberg.table.Table, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this method take a Table and not a Pathlike, like other write functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing to an iceberg table requires creating the pyiceberg.table.Table
object, which also requires an iceberg catalog.
I think it's best to delegate the creation of the catalog and table to be outside of this function and just pass in the table object.
The original implementation takes a str/Pathlike representing the "warehouse location" for the iceberg table. In Iceberg world, writing to a table usually requires an update to the catalog.
data = self.to_arrow() | ||
|
||
if mode == "append": | ||
table.append(data) | ||
else: | ||
table.overwrite(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is so simple that this does not warrant its own method, in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main goal is to implement the top-level write_iceberg
function for the dataframe API. The actual implementation code can be moved somewhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that, if users already have an Iceberg table object, they can just write table.append(df.to_arrow)
. It's even shorter than df.write_iceberg(table, mode="append")
. So there is not much added value to a write_iceberg
method.
If there is some complex logic required to set up the iceberg table, or if to_arrow
is not sufficient to handle all data types correctly, we can consider adding a write_iceberg
to save all users some implementation hassle. But right now it doesn't seem warranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true. Pyiceberg is well integrated with arrow. With an arrow dataframe and a PyIceberg table object, one can just invoke the pyiceberg write functions .append
/.overwrite
.
Given the above, is there still value in implementing a simple write_iceberg
method?
Looking at the write_delta
method, aside from the merge
functionality, the function just pass data as arrow into the write_deltalake
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is some complex logic required to set up the iceberg table
There's a scenario where a user might want to write an Iceberg table to a location in blob store. In such case, the write_iceberg
function can take care of creating an in-memory catalog and iceberg table object before writing.
That was the initial version of this PR
0801012
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a write_iceberg would be worth it just by virtue of having full support from polars / parity with delta.
DataFrame.write_iceberg
df = pl.DataFrame( | ||
{ | ||
"foo": [1, 2, 3, 4, 5], | ||
"bar": [6, 7, 8, 9, 10], | ||
"ham": ["a", "b", "c", "d", "e"], | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs much more extensive testing. Try using the df
fixture (it has a mix of many data types), or define your own dataframe with lots of different data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll take a look at this fixture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments. Needs testing on more different data types before this can be merged.
This is great. Excited to try this out |
Excited to try this one out! |
1fef815
to
62a923a
Compare
hey @stinodego thanks for the previous review, could you take another look at this PR? I've changed the test to use the |
Hey @kevinjqliu , pyiceberg now supports partition and hiddent partitions The overwrite method supports a overwrite filter parameter, similar to delete_where on delta writer |
@mariotaddeucci thanks for the pointer. I want to get this simple implementation in ASAP and add the partitioned write later |
Hey, really looking forward to this feature. Any reason progress has stopped / merging is blocked? @stinodego |
Resolves #14610
This PR introduces the
write_iceberg
function to the Dataframe API.write_iceberg
requires apyiceberg.table.Table
object and amode
(eitherappend
oroverwrite
).Note, partitioned write is currently not supported (blocked on iceberg-python #208)
Constructing the Iceberg catalog and Table object is left outside of this function. The test (
test_write_iceberg
) provides an example of creating an in-memory catalog and a Iceberg Table object.Example
Files created