Name	Name	Last commit message	Last commit date
parent directory ..
hive-etl	hive-etl
kafka-etl	kafka-etl
ldap-etl	ldap-etl
mce-cli	mce-cli
mysql-etl	mysql-etl
rdbms-etl	rdbms-etl
README.md	README.md

Metadata Ingestion

Prerequisites

Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. Easiest way to do that is through Docker images.
You also need to build the mxe-schemas module as below.
```
./gradlew :metadata-events:mxe-schemas:build
```
This is needed to generate MetadataChangeEvent.avsc which is the schema for MetadataChangeEvent Kafka topic.
Before launching each ETL ingestion pipeline, you can install/verify the library versions as below.
```
pip install --user -r requirements.txt
```

MCE Producer/Consumer CLI

mce_cli.py script provides a convenient way to produce a list of MCEs from a data file. Every MCE in the data file should be in a single line. It also supports consuming from MetadataChangeEvent topic.

➜  python mce_cli.py --help
usage: mce_cli.py [-h] [-b BOOTSTRAP_SERVERS] [-s SCHEMA_REGISTRY]
                  [-d DATA_FILE] [-l SCHEMA_RECORD]
                  {produce,consume}

Client for producing/consuming MetadataChangeEvent

positional arguments:
  {produce,consume}     Execution mode (produce | consume)

optional arguments:
  -h, --help            show this help message and exit
  -b BOOTSTRAP_SERVERS  Kafka broker(s) (localhost[:port])
  -s SCHEMA_REGISTRY    Schema Registry (http(s)://localhost[:port]
  -l SCHEMA_RECORD      Avro schema record; required if running 'producer' mode
  -d DATA_FILE          MCE data file; required if running 'producer' mode

Bootstrapping DataHub

Run the mce-cli to quickly ingest lots of sample data and test DataHub in action, you can run below command:

➜  python mce_cli.py produce -d bootstrap_mce.dat
Producing MetadataChangeEvent records to topic MetadataChangeEvent. ^c to exit.
  MCE1: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:foo", "aspects": [{"active": True,"email": "foo@linkedin.com"}]}), "proposedDelta": None}
  MCE2: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:bar", "aspects": [{"active": False,"email": "bar@linkedin.com"}]}), "proposedDelta": None}
Flushing records...

This will bootstrap DataHub with sample datasets and sample users.

Ingest metadata from LDAP to DataHub

The ldap_etl provides you ETL channel to communicate with your LDAP server.

➜  Config your LDAP server environmental variable in the file.
    LDAPSERVER    # Your server host.
    BASEDN        # Base dn as a container location.
    LDAPUSER      # Your credential.
    LDAPPASSWORD  # Your password.
    PAGESIZE      # Pagination size.
    ATTRLIST      # Return attributes relate to your model.
    SEARCHFILTER  # Filter to build the search query.
    
➜  Config your Kafka broker environmental variable in the file.
    AVROLOADPATH   # Your model event in avro format.
    KAFKATOPIC     # Your event topic.
    BOOTSTRAP      # Kafka bootstrap server.
    SCHEMAREGISTRY # Kafka schema registry host.

➜  python ldap_etl.py

This will bootstrap DataHub with your metadata in the LDAP server as an user entity.

Ingest metadata from Hive to DataHub

The hive_etl provides you ETL channel to communicate with your hive store.

➜  Config your hive store environmental variable in the file.
    HIVESTORE      # Your store host.
    
➜  Config your Kafka broker environmental variable in the file.
    AVROLOADPATH   # Your model event in avro format.
    KAFKATOPIC     # Your event topic.
    BOOTSTRAP      # Kafka bootstrap server.
    SCHEMAREGISTRY # Kafka schema registry host.

➜  python hive_etl.py

This will bootstrap DataHub with your metadata in the hive store as a dataset entity.

Ingest metadata from Kafka to DataHub

The kafka_etl provides you ETL channel to communicate with your kafka.

➜  Config your kafka environmental variable in the file.
    ZOOKEEPER      # Your zookeeper host.
    
➜  Config your Kafka broker environmental variable in the file.
    AVROLOADPATH   # Your model event in avro format.
    KAFKATOPIC     # Your event topic.
    BOOTSTRAP      # Kafka bootstrap server.
    SCHEMAREGISTRY # Kafka schema registry host.

➜  python kafka_etl.py

This will bootstrap DataHub with your metadata in the kafka as a dataset entity.

Ingest metadata from MySQL to DataHub

The mysql_etl provides you ETL channel to communicate with your MySQL.

➜  Config your MySQL environmental variable in the file.
    HOST           # Your server host.
    DATABASE       # Target database.
    USER           # Your user account.
    PASSWORD       # Your password.
    
➜  Config your kafka broker environmental variable in the file.
    AVROLOADPATH   # Your model event in avro format.
    KAFKATOPIC     # Your event topic.
    BOOTSTRAP      # Kafka bootstrap server.
    SCHEMAREGISTRY # Kafka schema registry host.

➜  python mysql_etl.py

This will bootstrap DataHub with your metadata in the MySQL as a dataset entity.

Ingest metadata from RDBMS to DataHub

The rdbms_etl provides you ETL channel to communicate with your RDBMS.

Currently supports IBM DB2, Firebird, MSSQL Server, MySQL, Oracle,PostgreSQL, SQLite and ODBC connections.
Some platform-specific logic are modularized and required to be implemented on your ad-hoc usage.

➜  Config your MySQL environmental variable in the file.
    HOST           # Your server host.
    DATABASE       # Target database.
    USER           # Your user account.
    PASSWORD       # Your password.
    PORT           # Connection port.
    
➜  Config your kafka broker environmental variable in the file.
    AVROLOADPATH   # Your model event in avro format.
    KAFKATOPIC     # Your event topic.
    BOOTSTRAP      # Kafka bootstrap server.
    SCHEMAREGISTRY # Kafka schema registry host.

➜  python rdbms_etl.py

This will bootstrap DataHub with your metadata in the RDBMS as a dataset entity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metadata-ingestion

metadata-ingestion

README.md

Metadata Ingestion

Prerequisites

MCE Producer/Consumer CLI

Bootstrapping DataHub

Ingest metadata from LDAP to DataHub

Ingest metadata from Hive to DataHub

Ingest metadata from Kafka to DataHub

Ingest metadata from MySQL to DataHub

Ingest metadata from RDBMS to DataHub

Files

metadata-ingestion

Directory actions

More options

Directory actions

More options

Latest commit

History

metadata-ingestion

Folders and files

parent directory

README.md

Metadata Ingestion

Prerequisites

MCE Producer/Consumer CLI

Bootstrapping DataHub

Ingest metadata from LDAP to DataHub

Ingest metadata from Hive to DataHub

Ingest metadata from Kafka to DataHub

Ingest metadata from MySQL to DataHub

Ingest metadata from RDBMS to DataHub