Kafka HDFS Ingestion

Job constructs

Source and Extractor

Gobblin provides two abstract classes, KafkaSource and KafkaExtractor. KafkaSource creates a workunit for each Kafka topic partition to be pulled, then merges and groups the workunits based on the desired number of workunits specified by property mr.job.max.mappers (this property is used in both standalone and MR mode). KafkaExtractor extracts the partitions assigned to a workunit, based on the specified low watermark and high watermark.

To use them in a Kafka-HDFS ingestion job, one should subclass KafkaExtractor and implement method decodeRecord(MessageAndOffset), which takes a MessageAndOffset object pulled from the Kafka broker and decodes it into a desired object. One should also subclass KafkaSource and implement getExtractor(WorkUnitState) which should return an object of the Extractor class.

Gobblin currently provides two concrete implementations: KafkaSimpleSource/KafkaSimpleExtractor, and KafkaAvroSource/KafkaAvroExtractor.

KafkaSimpleExtractor simply returns the payload of the MessageAndOffset object as a byte array. A job that uses KafkaSimpleExtractor may use a Converter to convert the byte array to whatever format desired. For example, if the desired output format is JSON, one may implement an ByteArrayToJsonConverter to convert the byte array to JSON. Alternatively one may implement a KafkaJsonExtractor, which extends KafkaExtractor and convert the MessageAndOffset object into a JSON object in the decodeRecord method. Both approaches should work equally well.

KafkaAvroExtractor decodes the payload of the MessageAndOffset object into an Avro GenericRecord object. It requires that the byte 0 of the payload be 0, bytes 1-16 of the payload be a 16-byte schema ID, and the remaining bytes be the encoded Avro record. It also requires the existence of a schema registry that returns the Avro schema given the schema ID, which is used to decode the byte array. Thus this class is mainly applicable to LinkedIn's internal Kafka clusters.

Writer and Publisher

For Writer and Publisher, one may use the AvroHdfsDataWriter and the BaseDataPublisher, similar as the Wikipedia job. They will publish the records pulled in each task to a different folder as Avro files. Gobblin also has an AvroHdfsTimePartitionedWriter and a TimePartitionedDataPublisher. They publish records based on timestamp of the records, which means records pulled in the same task may be published to different folders, and records pulled in different tasks may be published to the same folder.

Important Job config properties

Job Config Below is a sample job config file

Launch Job Launching the job in standalone mode involves similar steps as the Wikipedia example job. The job can also be launched in MR mode. See deployment for more details.

Source | Documentation | Discussion Group

Home
[Getting Started](Getting Started)
Architecture
User Guide
- Working with Job Configuration Files
- [Deployment](Gobblin Deployment)
- Gobblin on Yarn
- Compaction
- [State Management and Watermarks] (State-Management-and-Watermarks)
- Working with the ForkOperator
- [Configuration Glossary](Configuration Properties Glossary)
- [Partitioned Writers](Partitioned Writers)
- Monitoring
- Schedulers
- [Job Execution History Store](Job Execution History Store)
- Gobblin Build Options
- Troubleshooting
- [FAQs] (FAQs)
Case Studies
- Kafka-HDFS Ingestion
- Publishing Data to S3
Gobblin Metrics
- [Quick Start](Gobblin Metrics)
- [Existing Reporters](Existing Reporters)
- [Metrics for Gobblin ETL](Metrics for Gobblin ETL)
- [Gobblin Metrics Architecture](Gobblin Metrics Architecture)
- [Implementing New Reporters](Implementing New Reporters)
- [Gobblin Metrics Performance](Gobblin Metrics Performance)
Developer Guide
- [Customization: New Source](Customization for New Source)
- [Customization: Converter/Operator](Customization for Converter and Operator)
- Code Style Guide
- IDE setup
- Monitoring Design
Project
- [Feature List](Feature List)
- Contributors/Team
- [Talks/Tech Blogs](Talks and Tech Blogs)
- News/Roadmap
- Posts
Miscellaneous
- Camus → Gobblin Migration
- Exactly Once Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka HDFS Ingestion

Job constructs

Source and Extractor

Writer and Publisher

Clone this wiki locally