Sqoop Extractor for iasWorld

This repository contains the dependencies and scripts necessary to run sqoop, a data extraction tool for transferring data from relational databases to Hadoop, Hive, or Parquet.

In this case, sqoop is used to export table dumps from iasWorld, the CCAO's system of record, to an HCatalog. The result is a set of partitioned and bucketed Parquet files which can be uploaded to AWS S3 and queried directly via AWS Athena.

Structure

Directories

docker-config/ - Configuration and setup files for the Hadoop/Hive backend. Used during Docker build only
drivers/ - Mounted during run to provide connection drivers to sqoop. Put OJDBC files here (ojdbc8.jar or ojdbc7.jar)
logs/ - Location of temporary log files. Logs are manually uploaded to AWS CloudWatch after each run is complete
scripts/ - Runtime scripts to run sqoop jobs within Docker
secrets/ - Mounted during run to provide DB password via a file. Alter secrets/IPTS_PASSWORD to contain your password
tables/ - Table definitions and metadata used to create Hive tables for sqoop to extract to. Manually stored since certain tables include partitioning and bucketing
target/ - Mounted during run as output directory. All parquet files and job artifacts are saved temporarily before being uploaded to S3

Important Files

Dockerfile - Dockerfile to build sqoop and all dependencies from scratch if unavailable via the GitLab container registry
run.sh - Main entrypoint script. Idempotent. Run with sudo ./run.sh to extract all iasWorld tables.
docker-compose.yaml - Defines the containers and environment needed to run sqoop jobs in a small, distributed Hadoop/Hive environment
.env - Contains DB connection details. Alter before running to provide your own details

Usage

Dependencies

You will need the following tools installed before using this repo:

Docker
Docker Compose
AWS CLI v2 - Authenticated using aws configure
moreutils - For the ts timestamp command
jq - To parse logs to JSON

The rest of the dependencies for sqoop are installed using the included Dockerfile. To retrieve them, run either of the following commands within the repo directory:

docker-compose pull - Grabs the latest image from the CCAO GitHub registry, if it exists
docker-compose build - Builds the sqoop image from the included Dockerfile

Update table schemas

If tables schemas are altered in iasWorld (column type change, new columns), then the associated table schema files need to be updated in order to extract the altered tables from iasWorld. To update the schema files:

(Optional) If new tables have been added, they must be added to tables/tables-list.csv
Change /tmp/scripts/run-sqoop.sh to /tmp/scripts/get-tables.sh in docker-compose.yaml
Run docker compose up and wait for the schema files (tables/$TABLE.sql) to update
Run ./update-tables.sh to add bucketing and partitioning to the table schemas
Update the cron job in the README with any new tables, as well as the actual cronjob using sudo crontab -e

Export Tables

Nearly all the functionality of this repository is contained in run.sh. This script will complete four main tasks:

Extract the specified tables from iasWorld and save them to the target/ directory as Parquet
Remove any existing files on S3 for the extracted tables
Upload the extracted Parquet files to S3
Upload a logstream of the extraction and uploading process to CloudWatch

By default, sudo ./run.sh will export all tables in iasWorld to target/ (and then to S3). To extract a specific table or tables, prefix the run command with the environmental variable IPTS_TABLE. For example sudo IPTS_TABLE="ASMT_HIST CNAME" ./run.sh will grab the ASMT_HIST and CNAME tables

You can also specify a TAXYR within IPTS_TABLE using conditional symbols. For example, sudo IPTS_TABLE="ASMT_HIST>2019 ADDRINDX=2020" ./run.sh will get only records with a TAXYR greater than 2019 for ASMT_HIST and only records with a TAXYR equal to 2020 for ADDRINDX. Only =, <, and > are allowed as conditional operators.

Scheduling

Table extractions are schedule via cron. To edit the schedule file, use sudo crontab -e. The example below schedules daily jobs for frequently updated tables and weekly ones for rarely-updated tables.

# Extract recent years from frequently used tables on weekdays at 1 AM CST
0 6 * * 1,2,3,4,5 cd /local/path/to/repo && YEAR="$(($(date '+\%Y') - 2))" IPTS_TABLE="ADDN>$YEAR APRVAL>$YEAR ASMT_HIST>$YEAR ASMT_ALL>$YEAR COMDAT>$YEAR CVLEG>$YEAR DWELDAT>$YEAR ENTER HTPAR>$YEAR LEGDAT>$YEAR OBY>$YEAR OWNDAT>$YEAR PARDAT>$YEAR PERMIT SALES SPLCOM>$YEAR" /bin/bash ./run.sh

# Extract all tables except for ASMT_ALL and ASMT_HIST on Saturday at 1 AM CST
0 6 * * 6 cd /local/path/to/repo && IPTS_TABLE="AASYSJUR ADDN ADDRINDX APRVAL CNAME COMDAT COMFEAT COMINTEXT COMNT COMNT3 CVLEG CVOWN CVTRAN DEDIT DWELDAT ENTER EXADMN EXAPP EXCODE EXDET HTAGNT HTDATES HTPAR LAND LEGDAT LPMOD LPNBHD OBY OWNDAT PARDAT RCOBY PERMIT SALES SPLCOM VALCLASS" /bin/bash ./run.sh

Useful Resources

Documentation and Guides

Sqoop User Guide v1.4.7
Hadoop Cluster Setup Guide
How to install and setup a 3-node Hadoop cluster
Setup a distributed Hadoop cluster with Docker
Tips for Docker/Hadoop cluster setup

Docker Resources

Sqoop Docker image
Hadoop Docker image - Sqoop image uses this as a dependency
Hadoop cluster Docker image - Useful setup and options for pseudo-distributed Hadoop

Hive/HCatalog Resources

Docker Hadoop Setup
HCatalog Setup
Hive Metastore Setup Reference

Debugging

SO post on --bindir sqoop option
SO post on --map-column-java option
On generating strings for --map-column-java using shell and awk
Post on java security options that cause Connection reset errors
Data nodes not connected/listed by Hadoop
Oracle JDBC connection issues

Data Type Mappings (used to populate `--map-column-hive`)

HCatalog Data Types
Java to Oracle Type Mapping Matrix
PL/SQL (Oracle SQL) to JDBC mapping - Not applicable here but still helpful

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Sqoop Extractor for iasWorld

Structure

Directories

Important Files

Usage

Dependencies

Update table schemas

Export Tables

Scheduling

Useful Resources

Documentation and Guides

Docker Resources

Hive/HCatalog Resources

Debugging

Data Type Mappings (used to populate `--map-column-hive`)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Sqoop Extractor for iasWorld

Structure

Directories

Important Files

Usage

Dependencies

Update table schemas

Export Tables

Scheduling

Useful Resources

Documentation and Guides

Docker Resources

Hive/HCatalog Resources

Debugging

Data Type Mappings (used to populate --map-column-hive)

Data Type Mappings (used to populate `--map-column-hive`)