Follow instructions in Quick Start to start up a YugabyteDB cluster.
The dataset emulates a simple IOT workload. The workload consists of two tables:
CREATE TABLE host (
id int PRIMARY KEY,
host_name TEXT,
LOCATION jsonb
);
CREATE TABLE host_data (
date timestamptz NOT NULL,
host_id int NOT NULL,
cpu double PRECISION,
tempc int,
status TEXT
);
Use ysqlsh to setup the tables
# Store YugabyteDB Master IP addresses in an env variable
export MASTER_ADDRESSES=<list of addresses>
# Save the IP address of a Yugabyte master in an env variable
export PGHOST=<MASTER IP ADDRESS>
# Drop if required.
ysqlsh -f drop_schema.sql
# Create tables and functions.
ysqlsh -f create_schema.sql
# Setup CDCSDK Stream
yb-admin create_change_data_stream ysql.yugabyte --master_addresses $MASTER_ADDRESSES
# Save the output of the previous command in an env variable.
export CDC_SDK_STREAM_ID=<id from previous command>
Use an existing installation of Apache Kafka or use Confluent Cloud.
OR
Follow instructions in Quick Start to start up the Confluent Platform using Docker and Docker Compose.
In Confluent Center, create two Kafka Topics:
- iceberg.public.host
- iceberg.public.host_data
# Store parameters in env variables:
export BOOTSTRAP_SERVERS=<Kafka Bootstrap Server IP Addresses>
export AWS_ACCESS_KEY_ID=<AWS Access Key ID>
export AWS_SECRET_ACCESS_KEY=<AWS Secret Access Key>
export AWS_SESSION_TOKEN=<AWS Session Token>
export AWS_REGION=<AWS Region of the S3 Bucket>
docker compose -f connect.yaml up -d
In this example, Apache Iceberg uses AWS Glue as the table catalog. Install and configure AWS CLI
OR
Use AWS Console to setup AWS S3, AWS Glue and AWS Athena.
# Create a database
aws glue create-database --database-input "{\"Name\":\"yb_cdc\"}" --endpoint https://${AWS_REGION}.amazonaws.com
export S3_PATH=<Path to a S3 directory e.g. s3://example/iceberg/>
The script sets up two connectors:
- replicate: Replicates the tables in Yugabyte by applying inserts, updates and deletes.
- trace: Adds a row for inserts, updates and deletes and does not apply them.
Replicate tables are named:
- public_host
- public_host_data
Trace tables are named:
-
trace_public_host
-
trace_public_host_data
./iceberg-sink.sh
./yb-connect.sh