Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conduct market research on c̶a̶t̶a̶l̶o̶g̶s̶ metastores #141

Open
astrojuanlu opened this issue Jun 18, 2024 · 8 comments
Open

Conduct market research on c̶a̶t̶a̶l̶o̶g̶s̶ metastores #141

astrojuanlu opened this issue Jun 18, 2024 · 8 comments

Comments

@astrojuanlu
Copy link
Member

At the moment, each platform offers only full read & write capabilities to their own catalog, and read-only capabilities for competitors:

image

(source)

And what's more important: data catalogs aren't new, but we're seeing catalogs created for different use cases and business needs: technical, business, and operational (source).

These are just some open source ones1 that have been in the news recently. But there's also Apache Nessie, the Hive Metastore, the Iceberg REST Catalog, probably others I'm missing. Then there are the commercial, vendor-driven ones.

And then we have... the Kedro Catalog!

We've sometimes got questions on "how does the Kedro Catalog compare to the Unity Catalog" - and the answer is that they're complementary, but this is not immediately clear to users (see kedro-org/kedro-plugins#542).

It's very clear that this is going to be a hot topic of discussion in the data engineering space in the coming months so we should have a good answer to how does Kedro interact with all these.

Footnotes

  1. counting Polaris as open source

@datajoely
Copy link

So I think it would be fair to say most mature enterprises will start to align around these table catalogs mostly for governance reasons, but also the other benefits afforded by the open table formats they use.

Confusingly this is all slightly different to a category sometimes called Enterprise Data Catalogs which cover players like Alation, Amundsen, Atlan. These will try and position themselves as a 'catalog of catalogs':
image

UC/Polaris open source may change their market position

From a UX point of view I think it would be nice for a user to configure their connections settings.py in [how we manage env 🤷] and then suddenly all datasets available to those credentials become available in our catalog with minimal boilerplate:

my_dataset:
   type: UnityCatalogDataset
   # no credentials or other stuff needed

@astrojuanlu
Copy link
Member Author

So I think it would be fair to say most mature enterprises will start to align around these table catalogs mostly for governance reasons, but also the other benefits afforded by the open table formats they use.

Confusingly this is all slightly different to a category sometimes called Enterprise Data Catalogs which cover players like Alation, Amundsen, Atlan. These will try and position themselves as a 'catalog of catalogs':

Definitely. Someone (or ourselves) need to bring some clarity to the terminology. I really like this exercise on orchestrators (infrastructure, scheduling, asset) for example.

From a UX point of view I think it would be nice for a user to configure their connections [...] and then suddenly all datasets available to those credentials become available in our catalog with minimal boilerplate:

We've traditionally mixed the location of the data with the data type (in-memory), which gave us a simple design that however exhibits significant flaws (kedro-org/kedro#1936 (comment), kedro-org/kedro#770 (comment)). If anything, maybe we should do a better job at separating those, given that your node functions are intimately tied to the in-memory representation of the data (if your node expects a pd.DataFrame, it will likely not work with anything else) whereas the transport/location is up for configuration (and could as well be Unity Catalog sometimes, and some other times a blob storage, plain disk storage, an MLflow run, or in-memory).

But this is a topic for kedro-org/kedro#1936.

@astrojuanlu
Copy link
Member Author

Update: Unity Catalog now has a UI https://github.com/unitycatalog/unitycatalog-ui

image

There are currently 3 documented integrations (Daft, DuckDB, Trino) http://docs.unitycatalog.io/

And looks like Spark 4.0 + Delta Lake 4.0 might give a better view on how to use Unity Catalog https://books.japila.pl/unity-catalog-internals/spark-integration/#demo

But

@astrojuanlu astrojuanlu changed the title Conduct market research on catalogs Conduct market research on ~catalogs~ metastores Aug 2, 2024
@astrojuanlu astrojuanlu changed the title Conduct market research on ~catalogs~ metastores Conduct market research on ~~catalogs~~ metastores Aug 2, 2024
@astrojuanlu astrojuanlu changed the title Conduct market research on ~~catalogs~~ metastores Conduct market research on c̶a̶t̶a̶l̶o̶g̶s̶ metastores Aug 2, 2024
@astrojuanlu
Copy link
Member Author

Polaris was open sourced 3 days ago, see apache/polaris#2, blog announcement https://www.snowflake.com/blog/polaris-catalog-open-source/

Seems to be based on Apache Nessie by Dremio?

@astrojuanlu
Copy link
Member Author

To note, both Unity Catalog and Polaris implement the Apache Iceberg’s REST catalog specification. Looks like we have a winner in terms of HTTP APIs at least.

@astrojuanlu
Copy link
Member Author

Some questions I posed about UC unitycatalog/unitycatalog#208 (comment)

@astrojuanlu
Copy link
Member Author

While other vendor-hosted catalogs deviate from the open source specification, which leads to lock-in, Snowflake’s service for Polaris Catalog is designed to be fully compatible with Polaris Catalog’s open source implementation both now and in the future.

@astrojuanlu
Copy link
Member Author

  • A good, neutral review https://blog.min.io/catalogs-it-moment/ (a few months old already)
  • And I recall @noklam found a LinkedIn comment that explained how Iceberg requires a separate metastore (be it Hive or Glue or else) but I can't locate it anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants