Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Add RewriteTablePath action interface #10920

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

laithalzyoud
Copy link

@laithalzyoud laithalzyoud commented Aug 12, 2024

A follow up PR on #10024

@laithalzyoud laithalzyoud changed the title Add CopyTable action interface Spark: Add CopyTable action interface Aug 12, 2024
@laithalzyoud
Copy link
Author

Does everyone agree that CopyTable is a good name for this action? Since it's just rewriting metadata, manifest and position delete files with the new location prefix and not actually copying the table, maybe it's better to name it something more specific like RewriteTableLocation or something similar

@manuzhang
Copy link
Contributor

Since it's just rewriting metadata, manifest and position delete files with the new location prefix and not actually copying the table

Isn't an implementation of this interface actually copying the table?

@laithalzyoud
Copy link
Author

Since it's just rewriting metadata, manifest and position delete files with the new location prefix and not actually copying the table

Isn't an implementation of this interface actually copying the table?

There is an explanation in the original PR that explains the functionality of this action

@flyrain
Copy link
Contributor

flyrain commented Aug 12, 2024

cc @szehon-ho @huaxingao

@flyrain
Copy link
Contributor

flyrain commented Aug 12, 2024

Isn't an implementation of this interface actually copying the table?

NO. I think it makes sense to go with a name like RewriteFileLocation, or RewriteFilePaths

@manuzhang
Copy link
Contributor

I'm a bit lost in the context. If I run this action without copying the table (e.g. rewrite s3 to gcs), isn't it in a corrupt status?

@laithalzyoud
Copy link
Author

I'm a bit lost in the context. If I run this action without copying the table (e.g. rewrite s3 to gcs), isn't it in a corrupt status?

@manuzhang Not really, since the rewritten files are stored in a staging directory, when you copy the new table to the new location you copy all the data files + the rewritten metadata files from the staging location and ignore the old metadata files. So the table in the old location will not be corrupted 👍

@laithalzyoud laithalzyoud changed the title Spark: Add CopyTable action interface Spark: Add RewriteTableLocaiton action interface Aug 13, 2024
@laithalzyoud laithalzyoud changed the title Spark: Add RewriteTableLocaiton action interface Spark: Add RewriteTableLocation action interface Aug 13, 2024
@@ -70,4 +70,10 @@ default RewritePositionDeleteFiles rewritePositionDeletes(Table table) {
throw new UnsupportedOperationException(
this.getClass().getName() + " does not implement rewritePositionDeletes");
}

/** Instantiates an action to copy table. */
default CopyTable copyTable(Table table) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to rename here as well


import org.apache.iceberg.Table;

public interface RewriteTableLocation extends Action<RewriteTableLocation, RewriteTableLocation.Result> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try to avoid using the name TableLocation, which is a dedicated name for the location of the table as a table property. Given the action is to rewrite the file paths within all metadata files, is rewriteFilePath more suitable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific, it's rewriteMetadataFilePath?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly in metadata files. I think it has to touch position/eq delete files as well as they have the absolute file paths as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyrain Yes it does modify position files as well. RewriteFilePath implies that the method rewrites a single file. Maybe RewriteTablePath?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the path rewritten could be a directory as well. I think RewriteTablePath is fine, other alternative cuold be RewriteAbsolutePath or Just RewritePath.

@laithalzyoud laithalzyoud changed the title Spark: Add RewriteTableLocation action interface Spark: Add RewriteTablePath action interface Aug 18, 2024
@flyrain
Copy link
Contributor

flyrain commented Aug 21, 2024

@amogh-jahagirdar @nastra, @anuragmantri, you may be interested as well.

* "00001-8893aa9e-f92e-4443-80e7-cfa42238a654.metadata.json".
* @return this for method chaining
*/
RewriteTablePath endVersion(String endVersion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endVersion is not clear. What do you think about copyVersion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use copyVersion if only a single version will be rewritten by the script, however the original implementation included functionality to rewrite all version between lastCopiedVersion and endVersion

RewriteTablePath stagingLocation(String stagingLocation);

/**
* Set the target table. It is optional if the start version is provided.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is start version here? Could you clarify where start version is provided?

Copy link
Author

@laithalzyoud laithalzyoud Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start version is the same as lastCopiedVersion, maybe we should name them as startVersion and endVersion, or remove this and always rewrite the paths for the full table. @flyrain Do you remember why you added such a feature?

}

@Override
public String dataFileListLocation() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should we clarify more by renaming them targetDataLocation, targetMetadataLocation and copiedVersion ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the location for a .txt list of files (data/metadata) to copy, so I don't think the suggested names are suitable for them

*/
package org.apache.iceberg.actions;

public class BaseRewriteTablePathActionResult implements RewriteTablePath.Result {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other result classes use Immutables so we might want to do the same for this one. You can take a look at BaseDeleteOrphanFiles for example

Laith Alzyoud added 2 commits September 16, 2024 16:56
…into add-copy-table-action-interface

# Conflicts:
#	api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
…Path.java


Fix typo

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants