A curated list of awesome open source tools and commercial products to catalog, version, and manage data π
- Amundsen: Data discovery and metadata engine for improving the productivity when interacting with data.
- Apache Atlas: Provides open metadata management and governance capabilities to build a data catalog.
- CKAN: Open-source DMS (data management system) for powering data hubs and data portals.
- DataHub: LinkedIn's generalized metadata search & discovery tool.
- Datatile: A library for managing, validating, summarizing, and visualizing data.
- Delta Lake: Storage layer that brings scalable, ACID transactions to Apache Spark and other engines.
- Dolt: SQL database that you can fork, clone, branch, merge, push and pull just like a git repository.
- DVC: Management and versioning of datasets and machine learning models.
- Hub: A dataset format for creating, storing, and collaborating on AI datasets of any size.
- Intake: A lightweight package for finding, investigating, loading and disseminating data.
- Quilt: A self-organizing data hub with S3 support.
- lakeFS: Repeatable, atomic and versioned data lake on top of object storage.
- Magda: A federated, open-source data catalog for all your big data and small data.
- Marquez: Collect, aggregate, and visualize a data ecosystem's metadata.
- Metacat: Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra.
- Milvus: An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy.
- OpenMetadata: A Single place to discover, collaborate and get your data right.
- Spark: Unified analytics engine for large-scale data processing.