Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove HivePageSource, HiveColumnHandle and HivePartitionKey from Iceberg Connector #1655

Closed
lxynov opened this issue Oct 3, 2019 · 0 comments

Comments

@lxynov
Copy link
Member

lxynov commented Oct 3, 2019

I'm working to replace HivePageSource, HiveColumnHandle and HivePartitionKey with IcebergPageSource, IcebergColumnHandle and IcebergPartitionKey in Iceberg Connector. This looks to be some nontrivial work so I'd like to create the issue here to track it and gather comments earlier.

Some benefits I can think of:

  • Fix bug in case of partition spec evolution
    • Explanation: It seems that the current implemention doesn't work in case of partition spec evolution. This is because
      • IcebergMetadata::getColumnHandles() uses the latest partition spec to determine if a column is a partitioning column.
      • IcebergSplitSource::getPartitionKeys() uses the partition spec associated with each scan task to do the similar thing.
      • Their results are not identical in case of partition spec evolution, which fails Iceberg Connector's call to buildColumnMappings()
    • In the new implementation, I'd not include a columnType field in IcebergColumnHandle, so that IcebergSplitSource::getPartitionKeys() can be the single source of truth.
  • Make schema evolution semantics conform to Iceberg's spec
    • Iceberg's schema evolution is id-based. I'd include a icebergId field in IcebergColumnHandle and use it as the column "key" rather than using the column name. Moreover, we can remove nameToId map from IcebergSplit once we have Iceberg IDs in IcebergColumnHandle.
  • Simplify logic
    • Some logic in these Hivexxx classes as well as related functions are necessary for Hive but not for Iceberg, e.g., logic to handle bucket evolution, etc. They can be removed to reduce complication.
  • Open opportunities to make Iceberg-specific improvements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant