An example of how we might make feature queries written in SQL easier to read, better documented and testable.
I have created fictious order + item data related to 40 orders, as per mock_dbt_data.ipynb
. Then uplodaed
these as tables into an empty project associated with my gmail. A service account connects this google project
with this dbt project, so the queries in this repo are testable on my local.
This project structure was created when dbt was initialised (ignore analysis/
, data/
, snapshots/
, tests/
which were not used).
models/
contains all the .sql files that do transformations:
- Models in
models/staging
are 1:1 with the source tables and do basic cleaning and filtering (ie. get data for the particular client over particular days) for each source table. - Models in
models/marts
select from staging models and calculate business logic. I have created a sub folderintermediate/
to hold intermediate transformations.features.sql
represents the model that joins all the features from the intermediate sub queries. schema.yml
files accompany models in these directories, and contain descriptions for each model along with the necessary tests that should be called. The naming is consistent with how dbt names their project but in reality is flexible, more info here: https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355
In the supplier pipeline we calculate a number of features about a supplier. Each subquery aggrgeates a bunch of statistics about a supplier, which we join at the end. We have repeat_customer
features that describe how often a supplier re-delivers to their customers, and repeat_item
features that describe how often a supplier re-delivers the same items.
Here I have shown the queries & structure we require to a) clean the order + item source tables, b) calculate the repeat_customer
and repeat_item
intermediate subqueries, c) join the results into a feature struct.
- Under the
models
section indbt_project.yml
, I have declared that all query results should be expressed as views in bigquery, but the feature struct should be created as a table. In reality we'd probably want all the models apart fromfeatures
to be materialised asempheral
which means they aren't built into the database (its unlikely we'd ever need to query them). See figure 1. - Under the
vars
section indbt_project.yml
, I have declared the variables we want to format the jinja templates with, but these could also be called from the command line. - In
models/staging/schema.yml
we define the source tables (containing mock data) that exist in the project. We also define some tests for the two staging tables; that the primary key ofstg_orders
andstg_items
is unique and non-null. I've included an example of a custom test calledlatest_supplier
built forstg_orders
, which can be found inmacros/test_latest_supplier.sql
; it will return the number of rows that fail the test, so a result of 0 means the test passes. - The
description
fields in all the schema yamls means that when we generate documentation each model / column can include a human readable string to explain it. See figure 2 and 3.
Figure 1: In the bq project, data was uploaded into the clean_eu
dataset and models are materialised into the dbt_milly
dataset.
Figure 2: The html documentation that is produced when dbt docs generate
is followed by dbt docs serve
.
Figure 3: The lineage graph that is produced in the documentation. Green blocks indicate source tables and the purple block indicates the final "mart" model. To calculate all features, we'd expect there to be about 8 entity source tables to start with, the same number of staging models, then many (30+) intermediate models which are combined into the final features
model.
This repo cannot be ran locally without the necessary dbt user credentials on disk, but the main two "jobs" for dbt are run
(materialise all the models in your warehouse) and test
(check that the materialised models fulfill some criteria). See figure 4 and 5 for what local outputs look like.
In a real life scenario, these two jobs can be scheduled via dbt cloud, which connects to bigquery and the dbt repository, so updated code and data will be used every time a run
or test
job is due to take place. However for feature generation, we might not require scheduled jobs and just use dbt to build our feature queries offline whenever we run a pipeline.