Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement TableProviderFactory for a IcebergTableFactory #586

Closed
matthewmturner opened this issue Aug 26, 2024 · 8 comments · Fixed by #600
Closed

Implement TableProviderFactory for a IcebergTableFactory #586

matthewmturner opened this issue Aug 26, 2024 · 8 comments · Fixed by #600
Assignees
Labels
good first issue Good for newcomers
Milestone

Comments

@matthewmturner
Copy link

matthewmturner commented Aug 26, 2024

I would like to be able to register iceberg tables with datafusion like so:

CREATE EXTERNAL TABLE my_table STORED AS ICEBERGTABLE LOCATION '/path/to/table';

If Iceberg provided a TableProviderFactory then we could register and use that (This is how i currently register deltalake tables).

@liurenjie1024
Copy link
Contributor

I think this is feasible because we have supported loading table from file systems directly:

pub struct StaticTable(Table);

And we can also utilize current table provider implementation.

@liurenjie1024 liurenjie1024 added the good first issue Good for newcomers label Aug 30, 2024
@liurenjie1024 liurenjie1024 added this to the 0.4.0 Release milestone Aug 30, 2024
@yukkit
Copy link
Contributor

yukkit commented Aug 31, 2024

Please assign it to me

@yukkit
Copy link
Contributor

yukkit commented Aug 31, 2024

Here are a few questions that need clarification:

  1. Is there a need to support specifying a version? @matthewmturner
  2. Where should the parameters for object storage, such as access-key-id and secret-access-key for S3, be obtained from? @liurenjie1024
    • StaticTable relies on FileIO.
    • For those using DataFusion, they can pre-construct an AmazonS3 ObjectStore and register it in RuntimeEnv. In the implementation of TableProviderFactory, the ObjectStore can be retrieved from the RuntimeEnv of the Session based on the location. However, FileIO cannot use ObjectStore.

Proposed Solution

For the first question:

  1. By default, use the latest version.
  2. [Optional] Support specifying a version, for example: CREATE EXTERNAL TABLE my_table STORED AS ICEBERGTABLE LOCATION '/path/to/table' OPTIONS ('version' '1');

For the second question:

  1. Specify the relevant parameters in the OPTIONS block of the DDL statement.
  2. Refactor FileIO so that it also can rely on ObjectStore as its storage backend.
  3. Register the FileIO used to read the location via a global variable.

@matthewmturner
Copy link
Author

@yukkit I am actually only familiar with the ObjectStore abstraction and not FileIO - ill need to look into that more. Prior to knowing that I had naively expected to leverage ObjectStore auth capabilities.

Regarding the version - im less familiar with iceberg so will defer to the iceberg teams recommendation.

@liurenjie1024
Copy link
Contributor

Hi, @yukkit Thanks for your interest and welcome to contribute!

For the first question:

  1. By default, use the latest version.
  2. [Optional] Support specifying a version, for example: CREATE EXTERNAL TABLE my_table STORED AS ICEBERGTABLE LOCATION '/path/to/table' OPTIONS ('version' '1');

As with this problem, how about we ask user to point the path to a table metadata file directly? The path + version approach is not clear about what this version means. For example, if the user is trying to reading a table managed by sql/hive catalogs, metadata file names are typically suffixed by uuids. Also version maybe confusing to users familiar with iceberg, since that typicall means snapshot version.

For the second question:

  1. Specify the relevant parameters in the OPTIONS block of the DDL statement.
  2. Refactor FileIO so that it also can rely on ObjectStore as its storage backend.
  3. Register the FileIO used to read the location via a global variable.

I'm +1 for option 1. While 2 and 3 are worth dicussiong options for improving FileIO, this use case doesn't seem a solid motivation. Also we don't need to ask user to provide credentials in config block, since currently opendal s3 operators could load for env: https://docs.rs/opendal/latest/opendal/services/struct.S3Config.html#structfield.disable_config_load . cc @Xuanwo to confirm.

@yukkit
Copy link
Contributor

yukkit commented Sep 2, 2024

@liurenjie1024 Thank you so much for your feedback! I agree with your points, and I will start working on it based on this approach.

@mkarbo
Copy link

mkarbo commented Oct 30, 2024

Will this enable insert into as well for iceberg?

@yukkit
Copy link
Contributor

yukkit commented Oct 31, 2024

Will this enable insert into as well for iceberg?

@mkarbo This feature is implemented solely to enable reading Iceberg data as an external table in DataFusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants