A Flexible Python ETL toolkit for datawarehousing framework based on Dask, Prefect and the pydata stack. It follows the original design principles from these libraries, combined with a functional programming approach to data engineering.
Google Cloud Platform (GCP) is used as the core infrastructure, particularly BigQuery (GBQ) and Cloud Storage (GCS) as the main storage engines. We follow Google's recommendations on how to use BigQuery for data warehouse applications with four layers:
- source data, in production environment or file-based
- staging, on GCS
- datavaault, on GBQ
- datamarts, on GBQ using ARRAY_AGG, STRUCT, UNNEST SQL-pattern
In order to take advantage of open data, the ability to mix various datasets together must be available. As of now, in order to to that, a substantial knowledge of programming and data engineering must be available to any who wishes to do so. This project library aims to make that task easier.
Using pip:
pip install nl_open_data
-> NOT IMPLEMENTED YET
Using Poetry: Being a Poetry managed package, installing via Poetry is also possible. Assuming Poetry is already installed:
- Clone the repository
- From your local clone's root folder, run
poetry install
There are two elements that need to be configured prior to using the library.
The GCP project id, bucket, and location should be given by editing nl-open-data/nl_open_data/config.toml
, allowing up to 3 choices at runtime: dev
, test
and prod
. Note that you must nest gcp projects details correctly for them to be interperted, as seen below. You must have the proper IAM (permissions) on the GCP projects (more details below).
Correct nesting in config file:
[gcp]
[gcp.prod]
project_id = "my_dev_project_id"
bucket = "my_dev_bucket"
location = "EU"
[gcp.test]
project_id = "my_test_project_id"
bucket = "my_test_bucket"
location = "EU"
[gcp.prod]
project_id = "my_prod_project_id"
bucket = "my_prod_bucket"
location = "EU"
Additionally, the local paths used by the library can configured here. Under [paths]
, define the path to the library, and other temporary folders.