-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Iceberg Glue and Hadoop catalog support #9646
Conversation
Marking as draft for now, I will also publish the Hadoop catalog part here. I think after the 2 PRs linked above are merged, 1 more PR is needed for refactoring the catalog factory, then 2 more PRs for Glue and Hadoop sounds like the right division. Please let me know what you think. |
Thanks, @jackye1995 for this PR. We are looking forward to this feature. Any high-level timelines will help. Thanks again. |
@haldes working on this right now, will publish in a few hours. |
published #10151 for anyone interested, that patch can be used if anyone want to directly use it while we are reviewing, I have ran all tests and Glue tests against it. |
@jackye1995 thanks, Hadoop catalog support is also getting along with this.? Any updates on that would be helpful. |
I am noticing that people are starting to abuse the use of HadoopCatalog. HadoopCatalog was developed in Iceberg mostly for testing purpose, and is never recommended for production use because it has a lot of limitations. I am discussing with the community to see what is the best way go proceed forward, maybe it's not a good idea to continue adding support for it, let's see. |
@jackye1995 I'm definitely biased in this, but we have a real use case that requires our data (including catalog / metastore) to be self-contained on our datalake and not depend on an external service of any kind. We chose Hadoop catalog because it seemed to be the simplest option, but if there are any options that we may have overlooked, I'd love to hear about them. Having said that, I would also argue that I prefer having the Hadoop catalog option implemented, with documentation that states "not recommended for prod - use at your own risk". That at least gives us a jumping-off point to hopefully improve both Iceberg and Trino to a point where maybe Hadoop catalog becomes viable for prod. Just my 2¢... |
@jackye1995 agree with @cccs-tom as we also have a use case in which we will need Hadoop catalog. Unfortunately, we are in a situation where we will not be able to access an external HMS and this feature will really help us. |
@jackye1995 With majority of the functionalities being supported and community is already using it in hadoop catalog mode, There is no point restricting support only from trino connector. I would say Trino must be open to add hadoop(any iceberg supported ) catalog inherently and iceberg to callout limitations of using each catalog in it's document. would love to hear if any callouts which is very specific by adding it to Trino. |
for context, the main conversation happened in the ASF slack Iceberg channel, link: https://the-asf.slack.com/archives/CF01LKV9S/p1638208838030200 Conversation:
I think as @findepi said, we cannot really stop people if there is a strong demand for Hadoop catalog, so adding the support and marking it as not recommended would likely be the best way forward. |
@RameshByndoor @haldes @cccs-tom had a discussion around this topic in the Iceberg community sync yesterday, I think we had some agreements: (1) we will continue to add support for Hadoop catalog in Trino. We cannot stop people from adding it anyway. I was more thinking about if we can build support based on the FileMetastore available already in Trino, but I think there are still some key semantic differences between the two. @findepi please let me know when we could make some progress in #10151, after that Hadoop catalog support will be a short PR that I already have locally. |
That sounds amazing. Thanks again for all your hard work @jackye1995. |
Thanks, @jackye1995 |
Superseded by #10151 + potential future PR with hadoop |
Draft based on #9584 and #9614 for adding Glue and Hadoop catalog support.
Refactoring step is specified in https://docs.google.com/document/d/1GCOzfMCofQ87ZLyi1_3yL4P3Jn3D2lKgztek9KfE1Mo/edit#heading=h.eei1d7r863wy