Remove HivePageSource, HiveColumnHandle and HivePartitionKey from Iceberg Connector #1655

lxynov · 2019-10-03T01:22:13Z

I'm working to replace HivePageSource, HiveColumnHandle and HivePartitionKey with IcebergPageSource, IcebergColumnHandle and IcebergPartitionKey in Iceberg Connector. This looks to be some nontrivial work so I'd like to create the issue here to track it and gather comments earlier.

Some benefits I can think of:

Fix bug in case of partition spec evolution
- Explanation: It seems that the current implemention doesn't work in case of partition spec evolution. This is because
  - IcebergMetadata::getColumnHandles() uses the latest partition spec to determine if a column is a partitioning column.
  - IcebergSplitSource::getPartitionKeys() uses the partition spec associated with each scan task to do the similar thing.
  - Their results are not identical in case of partition spec evolution, which fails Iceberg Connector's call to buildColumnMappings()
- In the new implementation, I'd not include a columnType field in IcebergColumnHandle, so that IcebergSplitSource::getPartitionKeys() can be the single source of truth.
Make schema evolution semantics conform to Iceberg's spec
- Iceberg's schema evolution is id-based. I'd include a icebergId field in IcebergColumnHandle and use it as the column "key" rather than using the column name. Moreover, we can remove nameToId map from IcebergSplit once we have Iceberg IDs in IcebergColumnHandle.
Simplify logic
- Some logic in these Hivexxx classes as well as related functions are necessary for Hive but not for Iceberg, e.g., logic to handle bucket evolution, etc. They can be removed to reduce complication.
Open opportunities to make Iceberg-specific improvements

The text was updated successfully, but these errors were encountered:

lxynov mentioned this issue Oct 5, 2019

Introduce IcebergPageSource and IcebergColumnHandle #1675

Merged

lxynov closed this as completed Oct 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove HivePageSource, HiveColumnHandle and HivePartitionKey from Iceberg Connector #1655

Remove HivePageSource, HiveColumnHandle and HivePartitionKey from Iceberg Connector #1655

lxynov commented Oct 3, 2019 •

edited

Loading

Remove HivePageSource, HiveColumnHandle and HivePartitionKey from Iceberg Connector #1655

Remove HivePageSource, HiveColumnHandle and HivePartitionKey from Iceberg Connector #1655

Comments

lxynov commented Oct 3, 2019 • edited Loading

lxynov commented Oct 3, 2019 •

edited

Loading