Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SHALLOW CLONE of Iceberg Tables #1522

Closed
wants to merge 4 commits into from

Conversation

jackierwzhang
Copy link
Contributor

Overview

As a followup to the SHALLOW CLONE support for Delta Lake, it would be great if we could enable SHALLOW CLONE on an Iceberg table as well. This will be a CLONVERT (CLONE + CONVERT) operation, in which we will create a Delta catalog table with files pointing to the original Iceberg table in one transaction.

Motivation

  1. It allows users to quickly experiment with Delta Lake without modifying the original Iceberg table's data.
  2. It simplifies the user flow by combining a Delta catalog table creation with an Iceberg conversion.

Further details

Similar to SHALLOW CLONE, it will work as follows:

  1. Clone a Iceberg catalog table (after the setup here)
CREATE TABLE [IF NOT EXISTS] delta SHALLOW CLONE iceberg.db.table [TBLPROPERTIES clause] [LOCATION path]
  1. Clone a path-based Iceberg table
CREATE TABLE [IF NOT EXISTS] delta SHALLOW CLONE iceberg.`/path/to/iceberg/table`[TBLPROPERTIES clause] [LOCATION path]

How was this patch tested?

New unit tests.

Does this PR introduce any user-facing changes?

No.

// The source relation can be an Iceberg table in form of `catalog.db.table` so we visit
// a multipart identifier instead of TableIdentifier (which does not support 3L namespace)
// in Spark 3.3.
val sourceRelation = new UnresolvedRelation(visitMultipartIdentifier(ctx.source))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that can be fixed once Delta upgrades to Spark 3.4? If yes, add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add

// the existing files and the newly added files.
val cloneSourceTable = sourceTbl match {
case source: CloneIcebergSource =>
// Reuse the existing schema so that the physical name of columns are consistent between
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is not clear. Why do we have to use the schema of the existing table when replacing it? Isn't shallow clone just referencing the existing data files in iceberg table? What does the "existing files" mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but if you are REPLACE on an existing Delta table and since Iceberg use column mapping, we have to make sure the column mapping metadata match during conversion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants