You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working to replace HivePageSource, HiveColumnHandle and HivePartitionKey with IcebergPageSource, IcebergColumnHandle and IcebergPartitionKey in Iceberg Connector. This looks to be some nontrivial work so I'd like to create the issue here to track it and gather comments earlier.
Explanation: It seems that the current implemention doesn't work in case of partition spec evolution. This is because
IcebergMetadata::getColumnHandles() uses the latest partition spec to determine if a column is a partitioning column.
IcebergSplitSource::getPartitionKeys() uses the partition spec associated with each scan task to do the similar thing.
Their results are not identical in case of partition spec evolution, which fails Iceberg Connector's call to buildColumnMappings()
In the new implementation, I'd not include a columnType field in IcebergColumnHandle, so that IcebergSplitSource::getPartitionKeys() can be the single source of truth.
Make schema evolution semantics conform to Iceberg's spec
Iceberg's schema evolution is id-based. I'd include a icebergId field in IcebergColumnHandle and use it as the column "key" rather than using the column name. Moreover, we can remove nameToId map from IcebergSplit once we have Iceberg IDs in IcebergColumnHandle.
Simplify logic
Some logic in these Hivexxx classes as well as related functions are necessary for Hive but not for Iceberg, e.g., logic to handle bucket evolution, etc. They can be removed to reduce complication.
Open opportunities to make Iceberg-specific improvements
The text was updated successfully, but these errors were encountered:
I'm working to replace
HivePageSource
,HiveColumnHandle
andHivePartitionKey
withIcebergPageSource
,IcebergColumnHandle
andIcebergPartitionKey
in Iceberg Connector. This looks to be some nontrivial work so I'd like to create the issue here to track it and gather comments earlier.Some benefits I can think of:
IcebergMetadata::getColumnHandles()
uses the latest partition spec to determine if a column is a partitioning column.IcebergSplitSource::getPartitionKeys()
uses the partition spec associated with each scan task to do the similar thing.buildColumnMappings()
columnType
field inIcebergColumnHandle
, so thatIcebergSplitSource::getPartitionKeys()
can be the single source of truth.icebergId
field inIcebergColumnHandle
and use it as the column "key" rather than using the column name. Moreover, we can removenameToId
map fromIcebergSplit
once we have Iceberg IDs inIcebergColumnHandle
.Hivexxx
classes as well as related functions are necessary for Hive but not for Iceberg, e.g., logic to handle bucket evolution, etc. They can be removed to reduce complication.The text was updated successfully, but these errors were encountered: