-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Add RewriteTablePath action interface #10920
base: main
Are you sure you want to change the base?
Spark: Add RewriteTablePath action interface #10920
Conversation
Does everyone agree that CopyTable is a good name for this action? Since it's just rewriting metadata, manifest and position delete files with the new location prefix and not actually copying the table, maybe it's better to name it something more specific like |
Isn't an implementation of this interface actually copying the table? |
There is an explanation in the original PR that explains the functionality of this action |
NO. I think it makes sense to go with a name like |
I'm a bit lost in the context. If I run this action without copying the table (e.g. rewrite |
@manuzhang Not really, since the rewritten files are stored in a staging directory, when you copy the new table to the new location you copy all the data files + the rewritten metadata files from the staging location and ignore the old metadata files. So the table in the old location will not be corrupted 👍 |
@@ -70,4 +70,10 @@ default RewritePositionDeleteFiles rewritePositionDeletes(Table table) { | |||
throw new UnsupportedOperationException( | |||
this.getClass().getName() + " does not implement rewritePositionDeletes"); | |||
} | |||
|
|||
/** Instantiates an action to copy table. */ | |||
default CopyTable copyTable(Table table) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to rename here as well
|
||
import org.apache.iceberg.Table; | ||
|
||
public interface RewriteTableLocation extends Action<RewriteTableLocation, RewriteTableLocation.Result> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd try to avoid using the name TableLocation
, which is a dedicated name for the location of the table as a table property. Given the action is to rewrite the file paths within all metadata files, is rewriteFilePath
more suitable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more specific, it's rewriteMetadataFilePath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly in metadata files. I think it has to touch position/eq delete files as well as they have the absolute file paths as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flyrain Yes it does modify position files as well. RewriteFilePath
implies that the method rewrites a single file. Maybe RewriteTablePath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the path rewritten could be a directory as well. I think RewriteTablePath
is fine, other alternative cuold be RewriteAbsolutePath
or Just RewritePath
.
@amogh-jahagirdar @nastra, @anuragmantri, you may be interested as well. |
* "00001-8893aa9e-f92e-4443-80e7-cfa42238a654.metadata.json". | ||
* @return this for method chaining | ||
*/ | ||
RewriteTablePath endVersion(String endVersion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endVersion
is not clear. What do you think about copyVersion
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use copyVersion
if only a single version will be rewritten by the script, however the original implementation included functionality to rewrite all version between lastCopiedVersion
and endVersion
RewriteTablePath stagingLocation(String stagingLocation); | ||
|
||
/** | ||
* Set the target table. It is optional if the start version is provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is start version here
? Could you clarify where start version
is provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start version is the same as lastCopiedVersion
, maybe we should name them as startVersion
and endVersion
, or remove this and always rewrite the paths for the full table. @flyrain Do you remember why you added such a feature?
} | ||
|
||
@Override | ||
public String dataFileListLocation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should we clarify more by renaming them targetDataLocation
, targetMetadataLocation
and copiedVersion
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the location for a .txt
list of files (data/metadata) to copy, so I don't think the suggested names are suitable for them
*/ | ||
package org.apache.iceberg.actions; | ||
|
||
public class BaseRewriteTablePathActionResult implements RewriteTablePath.Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other result classes use Immutables so we might want to do the same for this one. You can take a look at BaseDeleteOrphanFiles
for example
…into add-copy-table-action-interface # Conflicts: # api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
core/src/main/java/org/apache/iceberg/actions/BaseRewriteTablePath.java
Outdated
Show resolved
Hide resolved
…Path.java Fix typo Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
A follow up PR on #10024