Hive Glue Catalog Sync Agent

The Hive Glue Catalog Sync Agent is a software module that can be installed and configured within a Hive Metastore server, and provides outbound synchronisation to the AWS Glue Data Catalog. This enables you to seamlessly create objects on the AWS Catalog as they are created within your existing Hadoop/Hive environment without any operational overhead or tasks.

This project provides a jar that implements the MetastoreEventListener interface of Hive to capture all create and drop events for tables and partitions in your Hive Metastore. It then connects to Amazon Athena in your AWS Account, and runs these same commands against the Glue Catalog, to provide syncronisation of the catalog over time. Your Hive Metastore and Yarn cluster can be anywhere - on the cloud or on your own data center.

Within the HiveGlueCatalogSyncAgent, the DDL from metastore events is captured and written to a ConcurrentLinkedQueue.

This queue is then drained by a separate thread that writes ddl events to Amazon Athena via a JDBC connection. This architecture ensures that if your Yarn cluster becomes disconnected from the Cloud for some reason, that Catalog events will not be dropped.

Supported Events

Today the Catalog Sync Agent supports the following MetaStore events:

CreateTable
AddPartition
DropTable
DropPartition

Installation

You can build the software yourself by configuring Maven and issuing mvn package, which will result in the binary being built to aws-glue-catalog-sync-agent/target/HiveGlueCatalogSyncAgent-1.1-SNAPSHOT.jar, or alternatively you can download the jar from s3://awslabs-code-us-east-1/HiveGlueCatalogSyncAgent/HiveGlueCatalogSyncAgent-1.1-SNAPSHOT.jar (808ace2165025c9da7288d0caa3e6b91). You can also run mvn assembly:assembly, which generates a mega jar including dependencies aws-glue-catalog-sync-agent/target/HiveGlueCatalogSyncAgent-1.1-SNAPSHOT-jar-with-dependencies.jar also found here (ff54e3993d7add705840661c7ab048c2).

Required Dependencies

If you install the HiveGlueCatalogSyncAgent-1.1-SNAPSHOT.jar into your cluster, you must also ensure you have the following dependencies available:

Log4J
SLF4J
AWS Java SDK Core
org.antlr.stringtemplate
com.amazonaws.athena.jdbc.atl-athena-jdbc-driver, version 1.0.2-atlassian-1 or higher
AWS Java SDK CloudWatch Logs

Configuration Instructions

S3

Create or decide on a bucket (and a prefix) where results from Athena will be stored. You'll need to update the below IAM policy with the designated bucket.

IAM

First, Create a new IAM policy with the following permissions (update the policy with your bucket):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateDatabase",
                "glue:DeleteDatabase",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:UpdateDatabase",
                "glue:CreateTable",
                "glue:DeleteTable",
                "glue:BatchDeleteTable",
                "glue:UpdateTable",
                "glue:GetTable",
                "glue:GetTables",
                "glue:BatchCreatePartition",
                "glue:CreatePartition",
                "glue:DeletePartition",
                "glue:BatchDeletePartition",
                "glue:UpdatePartition",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:BatchGetPartition"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "athena:*",
                "logs:CreateLogGroup"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload",
                "s3:CreateBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::<my-bucket>",
                "arn:aws:s3:::<my-bucket>/*"
            ]
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:DescribeLogGroups",
                "logs:DescribeLogStreams",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:log-group:HIVE_METADATA_SYNC:*:*"
        }
    ]
}

Then:

Create an IAM role and attach the policy to it
If your Hive Metastore runs on EC2, attach the IAM Role to this instance. Otherwise, create an IAM user and generate an access and secret key.

Hive Configuration

Add the following keys to hive-site-xml:

hive.metastore.event.listeners - com.amazonaws.services.glue.catalog.HiveGlueCatalogSyncAgent
glue.catalog.athena.jdbc.url - The url to use to connect to Athena (default: jdbc:awsathena://athena.**us-east-1**.amazonaws.com:443)
glue.catalog.athena.s3.staging.dir - The bucket & prefix used to store Athena's query results
glue.catalog.user.key - If not using an instance attached IAM role, the IAM access key.
glue.catalog.user.secret - If not using an instance attached IAM role, the IAM access secret.
glue.catalog.dropTableIfExists - Should an already existing table be dropped and created (default: true)
glue.catalog.createMissingDB - Should DBs be created if they don't exist (default:true)
glue.catalog.athena.suppressAllDropEvents - prevents propagation of DropTable and DropPartition events to the remote environment

Add the Glue Sync Agent's jar to HMS' classpath and restart.

You should see newly created external tables and partitions replicated to Glue Data Catalog and logs in CloudWatch Logs.

Apache 2.0 Software License

see LICENSE for details

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github		.github
bin		bin
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
architecture.png		architecture.png
internals.png		internals.png
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hive Glue Catalog Sync Agent

Supported Events

Installation

Required Dependencies

Configuration Instructions

S3

IAM

Hive Configuration

About

Releases

Packages

Languages

License

abhi195/aws-glue-catalog-sync-agent-for-hive

Folders and files

Latest commit

History

Repository files navigation

Hive Glue Catalog Sync Agent

Supported Events

Installation

Required Dependencies

Configuration Instructions

S3

IAM

Hive Configuration

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages