This repository contains the dependencies and scripts necessary to run sqoop
, a data extraction tool for transferring data from relational databases to Hadoop, Hive, or Parquet.
In this case, sqoop
is used to export table dumps from iasWorld, the CCAO's system of record, to an HCatalog. The result is a set of partitioned and bucketed Parquet files which can be uploaded to AWS S3 and queried directly via AWS Athena.
docker-config/
- Configuration and setup files for the Hadoop/Hive backend. Used during Docker build onlydrivers/
- Mounted during run to provide connection drivers tosqoop
. Put OJDBC files here (ojdbc8.jar
orojdbc7.jar
)logs/
- Location of temporary log files. Logs are manually uploaded to AWS CloudWatch after each run is completescripts/
- Runtime scripts to runsqoop
jobs within Dockersecrets/
- Mounted during run to provide DB password via a file. Altersecrets/IPTS_PASSWORD
to contain your passwordtables/
- Table definitions and metadata used to create Hive tables forsqoop
to extract to. Manually stored since certain tables include partitioning and bucketingtarget/
- Mounted during run as output directory. All parquet files and job artifacts are saved temporarily before being uploaded to S3
Dockerfile
- Dockerfile to buildsqoop
and all dependencies from scratch if unavailable via the GitLab container registryrun.sh
- Main entrypoint script. Idempotent. Run withsudo ./run.sh
to extract all iasWorld tables.docker-compose.yaml
- Defines the containers and environment needed to runsqoop
jobs in a small, distributed Hadoop/Hive environment.env
- Contains DB connection details. Alter before running to provide your own details
You will need the following tools installed before using this repo:
- Docker
- Docker Compose
- AWS CLI v2 - Authenticated using
aws configure
- moreutils - For the
ts
timestamp command - jq - To parse logs to JSON
The rest of the dependencies for sqoop
are installed using the included Dockerfile
. To retrieve them, run either of the following commands within the repo directory:
docker-compose pull
- Grabs the latest image from the CCAO GitHub registry, if it existsdocker-compose build
- Builds thesqoop
image from the includedDockerfile
If tables schemas are altered in iasWorld (column type change, new columns), then the associated table schema files need to be updated in order to extract the altered tables from iasWorld. To update the schema files:
- (Optional) If new tables have been added, they must be added to
tables/tables-list.csv
- Change
/tmp/scripts/run-sqoop.sh
to/tmp/scripts/get-tables.sh
indocker-compose.yaml
- Run
docker compose up
and wait for the schema files (tables/$TABLE.sql
) to update - Run
./update-tables.sh
to add bucketing and partitioning to the table schemas - Update the cron job in the README with any new tables, as well as the actual cronjob using
sudo crontab -e
Nearly all the functionality of this repository is contained in run.sh
. This script will complete four main tasks:
- Extract the specified tables from iasWorld and save them to the
target/
directory as Parquet - Remove any existing files on S3 for the extracted tables
- Upload the extracted Parquet files to S3
- Upload a logstream of the extraction and uploading process to CloudWatch
By default, sudo ./run.sh
will export all tables in iasWorld to target/
(and then to S3). To extract a specific table or tables, prefix the run command with the environmental variable IPTS_TABLE
. For example sudo IPTS_TABLE="ASMT_HIST CNAME" ./run.sh
will grab the ASMT_HIST
and CNAME
tables
You can also specify a TAXYR
within IPTS_TABLE
using conditional symbols. For example, sudo IPTS_TABLE="ASMT_HIST>2019 ADDRINDX=2020" ./run.sh
will get only records with a TAXYR
greater than 2019 for ASMT_HIST
and only records with a TAXYR
equal to 2020 for ADDRINDX
. Only =
, <
, and >
are allowed as conditional operators.
Table extractions are schedule via cron
. To edit the schedule file, use sudo crontab -e
. The example below schedules daily jobs for frequently updated tables and weekly ones for rarely-updated tables.
# Extract recent years from frequently used tables on weekdays at 1 AM CST
0 6 * * 1,2,3,4,5 cd /local/path/to/repo && YEAR="$(($(date '+\%Y') - 2))" IPTS_TABLE="ADDN>$YEAR APRVAL>$YEAR ASMT_HIST>$YEAR ASMT_ALL>$YEAR COMDAT>$YEAR CVLEG>$YEAR DWELDAT>$YEAR ENTER HTPAR>$YEAR LEGDAT>$YEAR OBY>$YEAR OWNDAT>$YEAR PARDAT>$YEAR PERMIT SALES SPLCOM>$YEAR" /bin/bash ./run.sh
# Extract all tables except for ASMT_ALL and ASMT_HIST on Saturday at 1 AM CST
0 6 * * 6 cd /local/path/to/repo && IPTS_TABLE="AASYSJUR ADDN ADDRINDX APRVAL CNAME COMDAT COMFEAT COMINTEXT COMNT COMNT3 CVLEG CVOWN CVTRAN DEDIT DWELDAT ENTER EXADMN EXAPP EXCODE EXDET HTAGNT HTDATES HTPAR LAND LEGDAT LPMOD LPNBHD OBY OWNDAT PARDAT RCOBY PERMIT SALES SPLCOM VALCLASS" /bin/bash ./run.sh
- Sqoop User Guide v1.4.7
- Hadoop Cluster Setup Guide
- How to install and setup a 3-node Hadoop cluster
- Setup a distributed Hadoop cluster with Docker
- Tips for Docker/Hadoop cluster setup
- Sqoop Docker image
- Hadoop Docker image - Sqoop image uses this as a dependency
- Hadoop cluster Docker image - Useful setup and options for pseudo-distributed Hadoop
- SO post on
--bindir
sqoop option - SO post on
--map-column-java
option - On generating strings for
--map-column-java
using shell and awk - Post on java security options that cause
Connection reset
errors - Data nodes not connected/listed by Hadoop
- Oracle JDBC connection issues
- HCatalog Data Types
- Java to Oracle Type Mapping Matrix
- PL/SQL (Oracle SQL) to JDBC mapping - Not applicable here but still helpful