Skip to content

Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems. Gobblin features integrations with Apache Hadoop, Apache Kafka, Salesforce, S3, MySQL, Google etc.

License

Notifications You must be signed in to change notification settings

arjun4084346/gobblin

This branch is 4 commits ahead of, 68 commits behind apache/gobblin:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

546dc74 · Jul 23, 2024
Jun 26, 2024
Aug 31, 2023
Jul 31, 2017
Aug 31, 2023
Jul 1, 2021
Feb 9, 2021
Apr 5, 2023
Feb 5, 2024
Jul 18, 2024
Oct 23, 2020
Aug 31, 2023
Jun 9, 2021
May 30, 2024
Jul 11, 2024
Oct 28, 2023
Apr 27, 2022
Apr 19, 2024
Oct 26, 2023
Jul 18, 2024
Oct 18, 2022
Jan 22, 2021
Feb 14, 2023
Jan 18, 2023
Aug 31, 2023
Jun 3, 2024
Jan 22, 2021
Jul 18, 2024
Jul 22, 2024
Jul 19, 2024
Aug 29, 2017
Jun 26, 2024
Jun 28, 2024
Oct 18, 2022
Jul 15, 2024
Sep 22, 2023
Jul 23, 2024
Jun 12, 2024
Nov 18, 2020
Jul 21, 2022
Jul 31, 2017
Jun 11, 2024
Jul 17, 2024
May 30, 2024
Jul 16, 2024
Sep 6, 2023
Feb 4, 2021
Feb 4, 2021
Jan 3, 2021
Jul 15, 2019
Jan 1, 2021
Jan 26, 2021
Jun 14, 2023
Aug 22, 2023
Jan 3, 2018
Jul 2, 2018
Aug 21, 2020
Jan 19, 2023
Jul 16, 2024
Apr 5, 2023
Jan 6, 2017
Jun 14, 2023
Aug 15, 2018
Aug 15, 2018
Apr 17, 2021
Sep 10, 2018
Mar 9, 2016
Sep 22, 2023

Repository files navigation

Apache Gobblin

Build Status Documentation Status Maven Central Stack Overflow Join us on Slack codecov.io

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

  • Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
  • Data Organization within the lake (e.g. compaction, partitioning, deduplication)
  • Lifecycle Management of data within the lake (e.g. data retention)
  • Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

  • Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
  • Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
  • Supports stream and batch execution modes
  • Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

  • Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
  • Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
  • Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
  • Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
  • Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

  • A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
  • A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
  • A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

  • Java >= 1.8

If building the distribution with tests turned on:

  • Maven version 3.5.3

Instructions to download gradle wrapper

If you are going to build Gobblin from the source distribution, run the following command for downloading the gradle-wrapper.jar from Gobblin git repository to gradle/wrapper directory (replace GOBBLIN_VERSION in the URL with the version you downloaded).

wget --no-check-certificate -P gradle/wrapper https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar

(or)

curl --insecure -L https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar > gradle/wrapper/gradle-wrapper.jar

Alternatively, you can download it manually from: https://github.com/apache/gobblin/blob/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar

Make sure that you download it to gradle/wrapper directory.

Instructions to run Apache RAT (Release Audit Tool)

  1. Extract the archive file to your local directory.
  2. Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

  1. Extract the archive file to your local directory.
  2. Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
  3. Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links

About

Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems. Gobblin features integrations with Apache Hadoop, Apache Kafka, Salesforce, S3, MySQL, Google etc.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 98.7%
  • Shell 0.6%
  • Python 0.3%
  • JavaScript 0.2%
  • CSS 0.1%
  • HTML 0.1%