Using DuckDB for building data pipelines

Build a simple data lake based on Medallion Archirecture, to incrementally ingest and process Github events.

Clone The Repository

$ git clone https://github.com/pracdata/duckdb-pipeline.git

Setup Python Virtual Environment

$ cd duckdb-pipeline
$ python3 -m venv .venv
$ source .venv/bin/activate
# Install required packages
$ pip install -r requirements.txt

Configuration

Rename config.ini.template to config.ini
Edit config.ini and fill in your actual AWS S3 credential values in the [aws] section.
If you are using a S3 compatible storage, setup the s3_endpoint_url parameter as well. Otherwise remove the line
Edit config.ini and fill in the bucket names in [datalake] section for each zone in your data lake.

Scheduling

There are Python scripts available in scripts directory for each phase (Ingestion, Serilisation, Aggregation) for calling with a scheduler like Crontab. Following shows sample cron statements to run each pipeline script at an appropriate time. Update the paths to match your setting and also ensure you allow enough time between each pipeline to complete.

# schedule the ingestion pipeline script to run 15 minutes past each hour
15 * * * * /path/to/your/venv/bin/python3 /path/to/your/duckdb-pipeline/scripts/run_ingest_source_data.py >> /tmp/ingest_source_data.out 2>&1
# schedule the serialisation pipeline script to run 30 minutes past each hour
30 * * * * /path/to/your/venv/bin/python3 /path/to/your/duckdb-pipeline/scripts/run_serialise_raw_data.py >> /tmp/serialise_raw_data.out 2>&1
# schedule the aggregation pipeline script to run 2 hours past midnight
0 2 * * * /path/to/your/venv/bin/python3 /path/to/your/duckdb-pipeline/scripts/run_agg_silver_data.py >> /tmp/aggregate_silver_data.out 2>&1

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
config.ini.template		config.ini.template
data_lake_ingester.py		data_lake_ingester.py
data_lake_transformer.py		data_lake_transformer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using DuckDB for building data pipelines

Clone The Repository

Setup Python Virtual Environment

Configuration

Scheduling

About

Releases

Packages

Languages

pracdata/duckdb-pipeline

Folders and files

Latest commit

History

Repository files navigation

Using DuckDB for building data pipelines

Clone The Repository

Setup Python Virtual Environment

Configuration

Scheduling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages