-
Notifications
You must be signed in to change notification settings - Fork 2.5k
ASF Proposal
Hudi is a big-data storage library, that provides atomic upserts and incremental data consumption.
Hudi manages data stored in Apache Hadoop and other API compatible distributed file systems/cloud stores.
Hudi provides the ability to atomically upsert datasets with new values in near-real time, making data available quickly to existing query engines like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a sequence of changes to a dataset from a given point-in-time to enable incremental data pipelines that yield greater efficiency & latency than their typical batch counterparts. By carefully managing number of files & sizes, Hudi greatly aids both query engines (e.g: always providing well-sized files) and underlying storage (e.g: HDFS NameNode memory consumption).
Hudi is largely implemented as a Apache Spark library that reads/writes data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are supported via specialized Apache Hadoop input formats, that understand Hudi’s storage layout. Currently, Hudi manages datasets using a combination of Apache Parquet & Apache Avro file/serialization formats.
Apache Hadoop distributed filesystem (HDFS) & other compatible cloud storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as longer term analytical storage for thousands of organizations. Typical analytical datasets are built by reading data from a source (e.g: upstream databases, messaging buses, or other datasets), transforming the data, writing results back to storage, & making it available for analytical queries--all of this typically accomplished in batch jobs which operate in a bulk fashion on partitions of datasets. Such a style of processing typically incurs large delays in making data available to queries as well as lot of complexity in carefully partitioning datasets to guarantee latency SLAs.
The need for fresher/faster analytics has increased enormously in the past few years, as evidenced by the popularity of Stream processing systems like Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By using updateable state store to incrementally compute & instantly reflect new results to queries and using a “tailable” messaging bus to publish these results to other downstream jobs, such systems employ a different approach to building analytical dataset. Even though this approach yields low latency, the amount of data managed in such real-time data-marts is typically limited in comparison to the aforementioned longer term storage options. As a result, the overall data architecture has become more complex with more moving parts and specialized systems, leading to duplication of data and a strain on usability.
Hudi takes a hybrid approach. Instead of moving vast amounts of batch data to streaming systems, we simply add the streaming primitives (upserts & incremental consumption) onto existing batch processing technologies. We believe that by adding some missing blocks to an existing Hadoop stack, we are able to a provide similar capabilities right on top of Hadoop at a reduced cost and with an increased efficiency, greatly simplifying the overall architecture in the process.
Hudi was originally developed at Uber (original name “Hoodie”) to address such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data ecosystem that required the upsert & incremental consumption primitives supported by Hudi.
We truly believe the capabilities supported by Hudi would be increasingly useful for big-data ecosystems, as data volumes & need for faster data continue to increase. A detailed description of target use-cases can be found at https://uber.github.io/hudi/use_cases.html.
Given our reliance on so many great Apache projects, we believe that the Apache way of open source community driven development will enable us to evolve Hudi in collaboration with a diverse set of contributors who can bring new ideas into the project.
- Move the existing codebase, website, documentation, and mailing lists to an Apache-hosted infrastructure.
- Integrate with the Apache development process.
- Ensure all dependencies are compliant with Apache License version 2.0.
- Incrementally develop and release per Apache guidelines.
Hudi is a stable project used in production at Uber since 2016 and was open sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi manages 4000+ tables holding several petabytes, bringing our Hadoop warehouse from several hours of data delays to under 30 minutes, over the past two years. The source code is currently hosted at github.com (https://github.com/uber/hudi ), which will seed the Apache git repository.
- Meritocracy:
- Community:
- Core Developers:
- Alignment:
- Orphaned products:
- Inexperience with Open Source:
- Length of Incubation:
- Homogenous Developers:
- Reliance on Salaried Developers:
- Relationships with Other Apache Products:
- A Excessive Fascination with the Apache Brand:
[1] Detailed documentation can be found at https://uber.github.io/hudi/
The codebase is currently hosted on Github: https://github.com/uber/hudi . During incubation, the codebase will be migrated to an Apache infrastructure. The source code already has an Apache 2.0 licensed.
Current code is Apache 2.0 licensed and the copyright is assigned to Uber. If the project enters incubator, Uber will transfer the source code & trademark ownership to ASF via a Software Grant Agreement
Non apache dependencies are listed below
- JCommander (1.48) Apache-2.0
- Kryo (4.0.0) BSD-2-Clause
- Kryo (2.21) BSD-3-Clause
- Jackson-annotations (2.6.4) Apache-2.0
- Jackson-annotations (2.6.5) Apache-2.0
- jackson-databind (2.6.4) Apache-2.0
- jackson-databind (2.6.5) Apache-2.0
- Jackson datatype: Guava (2.9.4) Apache-2.0
- docker-java (3.1.0-rc-3) Apache-2.0
- Guava: Google Core Libraries for Java (20.0) Apache-2.0
- bijection-avro (0.9.2) Apache-2.0
- com.twitter.common:objectsize (0.0.12) Apache-2.0
- Ascii Table (0.2.5) Apache-2.0
- config (3.0.0) Apache-2.0
- utils (3.0.0) Apache-2.0
- kafka-avro-serializer (3.0.0) Apache-2.0
- kafka-schema-registry-client (3.0.0) Apache-2.0
- Metrics Core (3.1.1) Apache-2.0
- Graphite Integration for Metrics (3.1.1) Apache-2.0
- Joda-Time (2.9.6) Apache-2.0
- JUnit CPL-1.0
- Awaitility (3.1.2) Apache-2.0
- jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
- jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
- jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
- htrace-core (3.0.4) Apache-2.0
- Mockito (1.10.19) MIT
- scalatest (3.0.1) Apache-2.0
- Spring Shell (1.2.0.RELEASE) Apache-2.0
No cryptographic libraries used
- private@hudi.incubator.apache.org (with moderated subscriptions)
- dev@hudi.incubator.apache.org
- commits@hudi.incubator.apache.org
- user@hudi.incubator.apache.org
Git is the preferred source control system: git://git.apache.org/incubator-hudi
We prefer to use the Apache gitbox integration to sync Github & Apache infrastructure, and rely on Github issues & pull requests for community engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
- Vinoth Chandar ( vinoth at uber dot com)
- Nishith Agarwal (nagarwal at uber dot com)
- Balaji Varadarajan (varadarb at uber dot com)
- Prasanna Rajaperumal (prasanna dot raj at gmail dot com)
- Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com)
- Anbu Cheeralan (alunarbeach at gmail dot com)
- Jiale Tan (jiale dot tan at vungle dot com)
- Vinoth Chandar (Uber)
- Nishith Agarwal (Uber)
- Balaji Varadarajan (Uber)
- Prasanna Rajaperumal (Snowflake)
- Zeeshan Qureshi (Shopify)
- Anbu Cheeralan (DoubleVerify)
- Jiale Tan (Vungle)
Zheng Shao (zshao at apache dot org, Apache Hadoop PMC Member)
Julien Le Dem <julien></julien> Kishore Gopalakrishna < kishoreg at apache dot org>
the Incubator PMC