Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding add_files_overwrite method #810

Closed
wants to merge 2 commits into from

Conversation

enkidulan
Copy link
Contributor

@enkidulan enkidulan commented Jun 11, 2024

Issue: #809

Blockers: #569 (this is what this PR is based on)

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very cool to see that others are finding the new add_files API useful 😄

I left a comment - there's an open PR that I think we should base this implementation on.

"""
if self._table.name_mapping() is None:
self.set_properties(**{TableProperties.DEFAULT_NAME_MAPPING: self._table.schema().name_mapping.model_dump_json()})
with self.update_snapshot(snapshot_properties=snapshot_properties).overwrite() as update_snapshot:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an open PR to implement partial deletes that I think we should leverage for this API.

Similar to the proposed implementation of overwrite, instead of calling overwrite() I think we'd want to invoke self.delete() with the overwrite_filter to delete all or partial data and rewrite them, and then add the data files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks @syun64 . I'll rebase my branch then and use self.delete(). However, we are also interested in keeping the previous data, so we keep history not changed even on overwriting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syun64 I've rebased on Fokko:fd-add-ability-to-delete-full-data-files branch and used delete function in add_files_overwrite. The tests I added are passing, but it made it hard to review this PR, probably makes sense to put it on pause until fokko's PR is merged. I'll still be pushing more updates to tests.

As for add_files_overwrite, at this point, I'm interested only in overwriting all, so I'm not adding the overwrite_filter argument like the overwrite method has. But I can add overwrite_filter argument if you think it makes sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will still be able to achieve that by setting the filter as ALWAYS_TRUE

My opinion on this is that when we introduce a new API, we should try to make it serve a more generic use case (overwriting based on a filter) versus a very specific use case (overwriting all data), especially given that there's already a great similar API pattern in overwrite. I think mimicking the inputs there will make the API feel more consistent for other users as well.

Another option we can take to support your specific use case, if the community decides that we don't want to add a new API for add_files_overwrite, is to just build your own transaction like below:

with tbl.transaction() as txn:
    txn.delete(delete_filter=ALWAYS_TRUE)
    txn.add_files(file_paths=file_paths)

This will achieve your desired outcome in a single atomic transaction to the iceberg table without needing to add a new API to the repository

@enkidulan enkidulan force-pushed the add-files-overwrite branch from 2d17fc4 to 09193eb Compare June 12, 2024 01:36
@enkidulan enkidulan force-pushed the add-files-overwrite branch 2 times, most recently from bb9a4d1 to 6201cb9 Compare July 24, 2024 05:45
@enkidulan enkidulan force-pushed the add-files-overwrite branch from 6201cb9 to a9d8a1a Compare July 25, 2024 05:36
@enkidulan enkidulan marked this pull request as ready for review July 25, 2024 14:48
@enkidulan
Copy link
Contributor Author

@sungwy #569 was merged last week, so I've updated this PR. Can you glance over it and say if adding add_files_overwrite makes sense to you?

@sungwy
Copy link
Collaborator

sungwy commented Jul 31, 2024

Hi @enkidulan - thank you very much for putting in the time to write up this PR. I'm very appreciative of the work and the level of interest you have in the new API add_files

I gave this a bit more thought, and I think with any new Public APIs we introduce to the repository, we carefully consider whether the value added by adding a new API outweighs the cost of maintaining it. A new public API means we have a new public function that we need to support for our community to ensure that the feature remains backward compatible.

While we could argue that the feature is simple enough since it's an extension of two already public facing APIs, I believe that that also leads to the counterargument, that if the function is so simple and can be achieved as a combination of existing public APIs and have no difference in the functionality I question why we would want to create the work of maintaining a public API.

       with self.transaction() as tx:
          tx.delete(delete_filter=overwrite_filter)
          tx.add_files(file_paths=file_paths, snapshot_properties=snapshot_properties)

What do you think? Are you able to achieve your desired functional outcome without the proposed API?

@enkidulan
Copy link
Contributor Author

Thanks for getting back to me @sungwy . I agree with your point of view and respect the decision. I started working on this feature before #569 was introduced, and back then add_files_overwrite made much more sense.
I updated this PR after #569 was merged because it seemed like a logical extension for the provided shortcuts. There are append and overwrite methods, so it felt like having add_files_overwrite shortcut in addition to add_files method would make sense, too.
I'm closing the PR and the related issue, as the features provided in #569 are sufficient for what I need.

@enkidulan enkidulan closed this Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants