You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is another Hive/Hadoop and REST Catalog behavior discrepancies discovered while enabling integration test on REST catalog. The assumption here is all the existing Spark integration tests should pass as-is on REST Catalog just like how they would pass using Hive & Hadoop Catalog, because conceptually Spark expects the same behavior out of Iceberg, no matter what catalog types are used. Reference issue: #11079
This integration test passes on Hive & Hadoop Catalog, but does not pass on REST Catalog reference implementation (RESTCatalogAdapter over JdbcCatalog).
Why
Root Cause is REST Catalog reference implementation, comparing to Hive & Hadoop Catalog, will run extra logic on server side when responding to a CREATE OR REPLACE ${table} spark command. A CREATE OR REPLACE ${table} command will trigger a RemoveSnapshotRef (implements MetadataUpdate) change to be sent (within UpdateTableRequest) to REST Catalog server side. When the reference implementation server process this change, it will run removeRef method when building the replacement metadata, within which, snapshot log is cleared: https://github.com/apache/iceberg/blob/113c6e7/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1250. However, Hive and Hadoop Catalog does not run that method when building replacement metadata. At the end of the day, running CREATE OR REPLACE ${table} on top of REST Catalog will result in failure of validating snapshot log is kept intact after CREATE OR REPLACE command.
Questions
Should snapshot history be kept intact after table replacement call based on spec definition, no matter the catalog type used?
On table replacement, If snapshot-log should get cleared, then does snapshots list itself need to be cleared as well?
Should the answer to 2 questions above vary based on catalog types (i.e. they are catalog implementation details, not spec issue) - that Hive & Hadoop catalog are doing the right thing to not clear snapshot-log or snapshots on table replacement; and that since REST catalog implementation is not controlled by this repo, it can choose its implementation details freely (whether clear or not clear snapshot-log & snapshots on replacement). In this case, should we change the REST Catalog reference implementation (RESTCatalogAdapter on JdbcCatalog) so that its behavior is the same as Hive & Hadoop Catalog (where snapshot history is not cleared on replacement)? Or should we change the integration test so that we don't check snapshot history being intact after table replacement?
The text was updated successfully, but these errors were encountered:
Down to code level detail (root cause of why REST differs from Hadoop/Hive), these two methods, seemingly doing similar things when resetting main branch, but one would clear snapshotLog history, one would not:
Hadoop/Hive catalog only executes TableMetadata$Builder::resetMainBranch on replace table calls, while REST catalog (if server depends on TableMetadata class provided by Open Source release, like reference implementation of RESTAdapter over JdbcCatalog) will run TableMetadata$Builder::removeRef in addition on replace table calls.
Query engine
Spark
Question
Background
This is another Hive/Hadoop and REST Catalog behavior discrepancies discovered while enabling integration test on REST catalog. The assumption here is all the existing Spark integration tests should pass as-is on REST Catalog just like how they would pass using Hive & Hadoop Catalog, because conceptually Spark expects the same behavior out of Iceberg, no matter what catalog types are used. Reference issue: #11079
What
https://github.com/apache/iceberg/blob/main/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java#L813
This integration test passes on Hive & Hadoop Catalog, but does not pass on REST Catalog reference implementation (RESTCatalogAdapter over JdbcCatalog).
Why
Root Cause is REST Catalog reference implementation, comparing to Hive & Hadoop Catalog, will run extra logic on server side when responding to a
CREATE OR REPLACE ${table}
spark command. ACREATE OR REPLACE ${table}
command will trigger aRemoveSnapshotRef
(implements MetadataUpdate
) change to be sent (within UpdateTableRequest) to REST Catalog server side. When the reference implementation server process this change, it will runremoveRef
method when building the replacement metadata, within which, snapshot log is cleared: https://github.com/apache/iceberg/blob/113c6e7/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1250. However, Hive and Hadoop Catalog does not run that method when building replacement metadata. At the end of the day, runningCREATE OR REPLACE ${table}
on top of REST Catalog will result in failure of validating snapshot log is kept intact afterCREATE OR REPLACE
command.Questions
snapshot-log
should get cleared, then doessnapshots
list itself need to be cleared as well?snapshot-log
orsnapshots
on table replacement; and that since REST catalog implementation is not controlled by this repo, it can choose its implementation details freely (whether clear or not clearsnapshot-log
&snapshots
on replacement). In this case, should we change the REST Catalog reference implementation (RESTCatalogAdapter on JdbcCatalog) so that its behavior is the same as Hive & Hadoop Catalog (where snapshot history is not cleared on replacement)? Or should we change the integration test so that we don't check snapshot history being intact after table replacement?The text was updated successfully, but these errors were encountered: