Adding `add_files_overwrite` method #810

enkidulan · 2024-06-11T06:49:42Z

Issue: #809

~~Blockers: #569 (this is what this PR is based on)~~

sungwy

It's very cool to see that others are finding the new add_files API useful 😄

I left a comment - there's an open PR that I think we should base this implementation on.

sungwy · 2024-06-11T13:19:39Z

pyiceberg/table/__init__.py

+        """
+        if self._table.name_mapping() is None:
+            self.set_properties(**{TableProperties.DEFAULT_NAME_MAPPING: self._table.schema().name_mapping.model_dump_json()})
+        with self.update_snapshot(snapshot_properties=snapshot_properties).overwrite() as update_snapshot:


There's an open PR to implement partial deletes that I think we should leverage for this API.

Similar to the proposed implementation of overwrite, instead of calling overwrite() I think we'd want to invoke self.delete() with the overwrite_filter to delete all or partial data and rewrite them, and then add the data files.

I see, thanks @syun64 . I'll rebase my branch then and use self.delete(). However, we are also interested in keeping the previous data, so we keep history not changed even on overwriting.

@syun64 I've rebased on Fokko:fd-add-ability-to-delete-full-data-files branch and used delete function in add_files_overwrite. The tests I added are passing, but it made it hard to review this PR, probably makes sense to put it on pause until fokko's PR is merged. I'll still be pushing more updates to tests.

As for add_files_overwrite, at this point, I'm interested only in overwriting all, so I'm not adding the overwrite_filter argument like the overwrite method has. But I can add overwrite_filter argument if you think it makes sense.

I think you will still be able to achieve that by setting the filter as ALWAYS_TRUE

My opinion on this is that when we introduce a new API, we should try to make it serve a more generic use case (overwriting based on a filter) versus a very specific use case (overwriting all data), especially given that there's already a great similar API pattern in overwrite. I think mimicking the inputs there will make the API feel more consistent for other users as well.

Another option we can take to support your specific use case, if the community decides that we don't want to add a new API for add_files_overwrite, is to just build your own transaction like below:

with tbl.transaction() as txn: txn.delete(delete_filter=ALWAYS_TRUE) txn.add_files(file_paths=file_paths)

This will achieve your desired outcome in a single atomic transaction to the iceberg table without needing to add a new API to the repository

enkidulan · 2024-07-25T14:52:05Z

@sungwy #569 was merged last week, so I've updated this PR. Can you glance over it and say if adding add_files_overwrite makes sense to you?

sungwy · 2024-07-31T19:25:03Z

Hi @enkidulan - thank you very much for putting in the time to write up this PR. I'm very appreciative of the work and the level of interest you have in the new API add_files

I gave this a bit more thought, and I think with any new Public APIs we introduce to the repository, we carefully consider whether the value added by adding a new API outweighs the cost of maintaining it. A new public API means we have a new public function that we need to support for our community to ensure that the feature remains backward compatible.

While we could argue that the feature is simple enough since it's an extension of two already public facing APIs, I believe that that also leads to the counterargument, that if the function is so simple and can be achieved as a combination of existing public APIs and have no difference in the functionality I question why we would want to create the work of maintaining a public API.

       with self.transaction() as tx:
          tx.delete(delete_filter=overwrite_filter)
          tx.add_files(file_paths=file_paths, snapshot_properties=snapshot_properties)

What do you think? Are you able to achieve your desired functional outcome without the proposed API?

enkidulan · 2024-07-31T20:30:01Z

Thanks for getting back to me @sungwy . I agree with your point of view and respect the decision. I started working on this feature before #569 was introduced, and back then add_files_overwrite made much more sense.
I updated this PR after #569 was merged because it seemed like a logical extension for the provided shortcuts. There are append and overwrite methods, so it felt like having add_files_overwrite shortcut in addition to add_files method would make sense, too.
I'm closing the PR and the related issue, as the features provided in #569 are sufficient for what I need.

enkidulan mentioned this pull request Jun 11, 2024

Add overwrite support for add_files #809

Closed

sungwy reviewed Jun 11, 2024

View reviewed changes

enkidulan force-pushed the add-files-overwrite branch from 2d17fc4 to 09193eb Compare June 12, 2024 01:36

enkidulan force-pushed the add-files-overwrite branch 2 times, most recently from bb9a4d1 to 6201cb9 Compare July 24, 2024 05:45

Adding add_files_overwrite method

a9d8a1a

enkidulan force-pushed the add-files-overwrite branch from 6201cb9 to a9d8a1a Compare July 25, 2024 05:36

Adding add_files_overwrite method

8927424

enkidulan marked this pull request as ready for review July 25, 2024 14:48

enkidulan closed this Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding `add_files_overwrite` method #810

Adding `add_files_overwrite` method #810

enkidulan commented Jun 11, 2024 •

edited

Loading

sungwy left a comment

sungwy Jun 11, 2024

enkidulan Jun 11, 2024

enkidulan Jun 12, 2024

sungwy Jun 13, 2024

enkidulan commented Jul 25, 2024

sungwy commented Jul 31, 2024

enkidulan commented Jul 31, 2024

Adding add_files_overwrite method #810

Adding add_files_overwrite method #810

Conversation

enkidulan commented Jun 11, 2024 • edited Loading

sungwy left a comment

Choose a reason for hiding this comment

sungwy Jun 11, 2024

Choose a reason for hiding this comment

enkidulan Jun 11, 2024

Choose a reason for hiding this comment

enkidulan Jun 12, 2024

Choose a reason for hiding this comment

sungwy Jun 13, 2024

Choose a reason for hiding this comment

enkidulan commented Jul 25, 2024

sungwy commented Jul 31, 2024

enkidulan commented Jul 31, 2024

Adding `add_files_overwrite` method #810

Adding `add_files_overwrite` method #810

enkidulan commented Jun 11, 2024 •

edited

Loading