This repository hosts a semi-automated pipeline that converts a messy SQLite database into a clean, reliable source for analytics. The pipeline includes:
- Data validation through unit tests
- Error logging
- Automatic changelog updates
- Production database refresh
- Run
script.sh
and follow the prompts. - If necessary,
script.sh
will executedev/cleanse_data.py
to validate and cleandev/cademycode.db
. - If errors occur during validation, they will be logged, and the process will terminate.
- Otherwise,
cleanse_data.py
will update the clean database and changelog. - After a successful update, the new record count and update data will be written to
dev/changelog.md
. script.sh
will check the changelog for updates and request permission to update the production database if needed.
To run the script on the updated database, rename dev/cademycode_updated.db
to dev/cademycode.db
.
script.sh
: A bash script to run the data cleanser and move files to/prod
.dev/
:changelog.md
: Automatically updated with each run, logging new records and tracking missing data.cleanse_data.py
: Runs unit tests and data cleansing oncademycode.db
.cademycode_cleansed.db
: Output fromcleanse_data.py
, containing two tables.cademycode.db
: The raw data database with three tables.cademycode_updated.db
: An updated version ofcademycode.db
for testing the update process.
prod/
:changelog.md
: Copied from/dev
when updates are approved.cademycode_cleansed.db
: Copied from/dev
when updates are approved.