- Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. Easiest way to do that is through Docker images.
- You also need to build the
mxe-schemas
module as below.This is needed to generate./gradlew :metadata-events:mxe-schemas:build
MetadataChangeEvent.avsc
which is the schema forMetadataChangeEvent
Kafka topic. - Before launching each ETL ingestion pipeline, you can install/verify the library versions as below.
pip install --user -r requirements.txt
mce_cli.py
script provides a convenient way to produce a list of MCEs from a data file.
Every MCE in the data file should be in a single line. It also supports consuming from
MetadataChangeEvent
topic.
➜ python mce_cli.py --help
usage: mce_cli.py [-h] [-b BOOTSTRAP_SERVERS] [-s SCHEMA_REGISTRY]
[-d DATA_FILE] [-l SCHEMA_RECORD]
{produce,consume}
Client for producing/consuming MetadataChangeEvent
positional arguments:
{produce,consume} Execution mode (produce | consume)
optional arguments:
-h, --help show this help message and exit
-b BOOTSTRAP_SERVERS Kafka broker(s) (localhost[:port])
-s SCHEMA_REGISTRY Schema Registry (http(s)://localhost[:port]
-l SCHEMA_RECORD Avro schema record; required if running 'producer' mode
-d DATA_FILE MCE data file; required if running 'producer' mode
Run the mce-cli to quickly ingest lots of sample data and test DataHub in action, you can run below command:
➜ python mce_cli.py produce -d bootstrap_mce.dat
Producing MetadataChangeEvent records to topic MetadataChangeEvent. ^c to exit.
MCE1: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:foo", "aspects": [{"active": True,"email": "foo@linkedin.com"}]}), "proposedDelta": None}
MCE2: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:bar", "aspects": [{"active": False,"email": "bar@linkedin.com"}]}), "proposedDelta": None}
Flushing records...
This will bootstrap DataHub with sample datasets and sample users.
The ldap_etl provides you ETL channel to communicate with your LDAP server.
➜ Config your LDAP server environmental variable in the file.
LDAPSERVER # Your server host.
BASEDN # Base dn as a container location.
LDAPUSER # Your credential.
LDAPPASSWORD # Your password.
PAGESIZE # Pagination size.
ATTRLIST # Return attributes relate to your model.
SEARCHFILTER # Filter to build the search query.
➜ Config your Kafka broker environmental variable in the file.
AVROLOADPATH # Your model event in avro format.
KAFKATOPIC # Your event topic.
BOOTSTRAP # Kafka bootstrap server.
SCHEMAREGISTRY # Kafka schema registry host.
➜ python ldap_etl.py
This will bootstrap DataHub with your metadata in the LDAP server as an user entity.
The hive_etl provides you ETL channel to communicate with your hive store.
➜ Config your hive store environmental variable in the file.
HIVESTORE # Your store host.
➜ Config your Kafka broker environmental variable in the file.
AVROLOADPATH # Your model event in avro format.
KAFKATOPIC # Your event topic.
BOOTSTRAP # Kafka bootstrap server.
SCHEMAREGISTRY # Kafka schema registry host.
➜ python hive_etl.py
This will bootstrap DataHub with your metadata in the hive store as a dataset entity.
The kafka_etl provides you ETL channel to communicate with your kafka.
➜ Config your kafka environmental variable in the file.
ZOOKEEPER # Your zookeeper host.
➜ Config your Kafka broker environmental variable in the file.
AVROLOADPATH # Your model event in avro format.
KAFKATOPIC # Your event topic.
BOOTSTRAP # Kafka bootstrap server.
SCHEMAREGISTRY # Kafka schema registry host.
➜ python kafka_etl.py
This will bootstrap DataHub with your metadata in the kafka as a dataset entity.
The mysql_etl provides you ETL channel to communicate with your MySQL.
➜ Config your MySQL environmental variable in the file.
HOST # Your server host.
DATABASE # Target database.
USER # Your user account.
PASSWORD # Your password.
➜ Config your kafka broker environmental variable in the file.
AVROLOADPATH # Your model event in avro format.
KAFKATOPIC # Your event topic.
BOOTSTRAP # Kafka bootstrap server.
SCHEMAREGISTRY # Kafka schema registry host.
➜ python mysql_etl.py
This will bootstrap DataHub with your metadata in the MySQL as a dataset entity.
The rdbms_etl provides you ETL channel to communicate with your RDBMS.
- Currently supports IBM DB2, Firebird, MSSQL Server, MySQL, Oracle,PostgreSQL, SQLite and ODBC connections.
- Some platform-specific logic are modularized and required to be implemented on your ad-hoc usage.
➜ Config your MySQL environmental variable in the file.
HOST # Your server host.
DATABASE # Target database.
USER # Your user account.
PASSWORD # Your password.
PORT # Connection port.
➜ Config your kafka broker environmental variable in the file.
AVROLOADPATH # Your model event in avro format.
KAFKATOPIC # Your event topic.
BOOTSTRAP # Kafka bootstrap server.
SCHEMAREGISTRY # Kafka schema registry host.
➜ python rdbms_etl.py
This will bootstrap DataHub with your metadata in the RDBMS as a dataset entity.