-
Notifications
You must be signed in to change notification settings - Fork 10
Primer Tutorial | How to use dbt with TiDB
This is a primer tutorial about how to use dbt-tidb.
Here we use the official dbt tutorial as an example to introduce the use of TiDB in dbt together.
The software used in this example and its version requirements:
- TiDB 5.3 or upper
- dbt 1.01 or upper
- dbt-tidb 1.0.0
Installing dbt and dbt-tidb requires only one command because dbt is installed as a dependency when we install dbt-tidb.
$ pip install dbt-tidb
dbt can also be installed by yourself. Please refer to the official installation guide.
The jaffle_shop is a project provided by dbt-lab to demonstrate dbt functionality. You can get it directly from GitHub.
$ git clone https://github.com/dbt-labs/jaffle_shop
$ cd jaffle_shop
All files in the jaffle_shop project directory are as follows.
- dbt_project.yml is the dbt project configuration file, which holds the project name, database configuration file path information, etc.
- models directory contains the SQL model and table schema for the project. Note that this section was written by the data analyst himself.
- seed directory stores CSV files. Such CSV files can be dumped from database export tools. For example, TiDB can export the table data into CSV files through Dumpling. In the jaffle shop project, these CSV files are used as raw data to be processed.
I'll go into more detail about them later when I use one of the files or directories above.
ubuntu@ubuntu:~/jaffle_shop$ tree
.
├── dbt_project.yml
├── etc
│ ├── dbdiagram_definition.txt
│ └── jaffle_shop_erd.png
├── LICENSE
├── models
│ ├── customers.sql
│ ├── docs.md
│ ├── orders.sql
│ ├── overview.md
│ ├── schema.yml
│ └── staging
│ ├── schema.yml
│ ├── stg_customers.sql
│ ├── stg_orders.sql
│ └── stg_payments.sql
├── README.md
└── seeds
├── raw_customers.csv
├── raw_orders.csv
└── raw_payments.csv
- Global configuration
dbt has a default global profile: ~/.dbt/profiles.yml
, which we first set up in the user directory and configured the connection information for the TiDB database.
$ vi ~/.dbt/profiles.yml
jaffle_shop_tidb: # project name
target: dev # target
outputs:
dev:
type: tidb # adapter type
server: 127.0.0.1
port: 4000
schema: analytics # database name
username: root
password: ""
- Project configuration
In the jaffle_shop project directory, the project configuration file dbt_project.yml
is available. Change the profile
configuration item to jaffle_shop_tidb
, which is the project name in profiles.yml
. Then, the project will query the database connection configuration in the ~/.dbt/profiles.yml file
.
$ cat dbt_project.yml
name: 'jaffle_shop'
config-version: 2
version: '0.1'
profile: 'jaffle_shop_tidb' # note the modification here
model-paths: ["models"] # model path
seed-paths: ["seeds"] # seed path
test-paths: ["tests"]
analysis-paths: ["analysis"]
macro-paths: ["macros"]
target-path: "target"
clean-targets:
- "target"
- "dbt_modules"
- "logs"
require-dbt-version: [">=1.0.0", "<2.0.0"]
models:
jaffle_shop:
materialized: table # *.sql which in models/ would be materialized to table
staging:
materialized: view # *.sql which in models/staging/ would bt materialized to view
- Verify the configuration
You can run the following command to check whether the database and project configuration is correct.
$ dbt debug
06:59:18 Running with dbt=1.0.1
dbt version: 1.0.1
python version: 3.8.10
python path: /usr/bin/python3
os info: Linux-5.4.0-97-generic-x86_64-with-glibc2.29
Using profiles.yml file at /home/ubuntu/.dbt/profiles.yml
Using dbt_project.yml file at /home/ubuntu/jaffle_shop/dbt_project.yml
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Required dependencies:
- git [OK found]
Connection:
server: 127.0.0.1
port: 4000
database: None
schema: analytics
user: root
Connection test: [OK connection ok]
All checks passed!
Load the CSV data and materialize the CSV as a table in the target database. Note: In general, dbt projects do not need this step because the data for your pending projects is in the database.
$ dbt seed
07:03:24 Running with dbt=1.0.1
07:03:24 Partial parse save file not found. Starting full parse.
07:03:25 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
07:03:25
07:03:25 Concurrency: 1 threads (target='dev')
07:03:25
07:03:25 1 of 3 START seed file analytics.raw_customers.................................. [RUN]
07:03:25 1 of 3 OK loaded seed file analytics.raw_customers.............................. [INSERT 100 in 0.19s]
07:03:25 2 of 3 START seed file analytics.raw_orders..................................... [RUN]
07:03:25 2 of 3 OK loaded seed file analytics.raw_orders................................. [INSERT 99 in 0.14s]
07:03:25 3 of 3 START seed file analytics.raw_payments................................... [RUN]
07:03:26 3 of 3 OK loaded seed file analytics.raw_payments............................... [INSERT 113 in 0.24s]
07:03:26
07:03:26 Finished running 3 seeds in 0.71s.
07:03:26
07:03:26 Completed successfully
07:03:26
07:03:26 Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
In the above result, it is clear that three tasks have been performed and three tables ( analytics.raw_customers
, analytics.raw_orders
, analytics.raw_payments
) were loaded.
Next, go to the TiDB database to see what happens.
We found an extra analytics
database, which is the engineering database that dbt created for us.
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| INFORMATION_SCHEMA |
| METRICS_SCHEMA |
| PERFORMANCE_SCHEMA |
| analytics |
| mysql |
| test |
+--------------------+
6 rows in set (0.00 sec)
There are three tables in analytics
database, corresponding to the above three task results respectively.
mysql> show tables;
+---------------------+
| Tables_in_analytics |
+---------------------+
| raw_customers |
| raw_orders |
| raw_payments |
+---------------------+
3 rows in set (0.00 sec)
Before skipping to the next step, it's important to understand what role the model plays in dbt.
In dbt, models are used to describe the structure of a set of tables or views, and there are two main types of files: SQL and YML. Also note that in the jaffle_shop project, table structures are held under the models/
directory and view structures are held under the models/staging/
directory according to the materialization configuration.
For example, models/orders.sql
is an SQL query statement with jinja syntax. We will create a table by this query statement.
$ cat models/orders.sql
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
),
payments as (
select * from {{ ref('stg_payments') }}
),
order_payments as (
select
order_id,
{% for payment_method in payment_methods -%}
sum(case when payment_method = '{{ payment_method }}' then amount else 0 end) as {{ payment_method }}_amount,
{% endfor -%}
sum(amount) as total_amount
from payments
group by order_id
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{ payment_method }}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments
on orders.order_id = order_payments.order_id
)
select * from final
Also, the constraint information that goes with this SQL is in the models/schema.yml
file.
The schema.yml
is a registry of all models in the current directory. All models are organized into a tree structure that describes the description and attributes of each field. The tests
items represent constraints for this field, which can be tested using the dbt test
command. See the official documentation for more information.
cat models/schema.yml
version: 2
...
- name: orders
description: This table has basic information about orders, as well as some derived facts based on payments
columns:
- name: order_id
tests:
- unique
- not_null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships:
to: ref('customers')
field: customer_id
- name: order_date
description: Date (UTC) that the order was placed
- name: status
description: '{{ doc("orders_status") }}'
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- name: amount
description: Total amount (AUD) of the order
tests:
- not_null
- name: credit_card_amount
description: Amount of the order (AUD) paid for by credit card
tests:
- not_null
- name: coupon_amount
description: Amount of the order (AUD) paid for by coupon
tests:
- not_null
- name: bank_transfer_amount
description: Amount of the order (AUD) paid for by bank transfer
tests:
- not_null
- name: gift_card_amount
description: Amount of the order (AUD) paid for by gift card
tests:
- not_null
The result shows three views (analytics.stg_customers
, analytics.stg_orders
, analytics.stg_payments
) and two tables (analytics.customers
, analytics.orders
).
$ dbt run
07:28:43 Running with dbt=1.0.1
07:28:43 Unable to do partial parsing because profile has changed
07:28:43 Unable to do partial parsing because a project dependency has been added
07:28:44 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
07:28:44
07:28:44 Concurrency: 1 threads (target='dev')
07:28:44
07:28:44 1 of 5 START view model analytics.stg_customers................................. [RUN]
07:28:44 1 of 5 OK created view model analytics.stg_customers............................ [SUCCESS 0 in 0.12s]
07:28:44 2 of 5 START view model analytics.stg_orders.................................... [RUN]
07:28:44 2 of 5 OK created view model analytics.stg_orders............................... [SUCCESS 0 in 0.08s]
07:28:44 3 of 5 START view model analytics.stg_payments.................................. [RUN]
07:28:44 3 of 5 OK created view model analytics.stg_payments............................. [SUCCESS 0 in 0.07s]
07:28:44 4 of 5 START table model analytics.customers.................................... [RUN]
07:28:44 4 of 5 OK created table model analytics.customers............................... [SUCCESS 0 in 0.16s]
07:28:44 5 of 5 START table model analytics.orders....................................... [RUN]
07:28:45 5 of 5 OK created table model analytics.orders.................................. [SUCCESS 0 in 0.12s]
07:28:45
07:28:45 Finished running 3 view models, 2 table models in 0.64s.
07:28:45
07:28:45 Completed successfully
07:28:45
07:28:45 Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
Go to the TiDB database to verify that the creation is successful.
The result illustrates that five more tables or views, such as customers
, have been added, and the data in the tables or views have been transformed. Only part of customers
data is shown here.
mysql> show tables;
+---------------------+
| Tables_in_analytics |
+---------------------+
| customers |
| orders |
| raw_customers |
| raw_orders |
| raw_payments |
| stg_customers |
| stg_orders |
| stg_payments |
+---------------------+
8 rows in set (0.00 sec)
mysql> select * from customers;
+-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+
| customer_id | first_name | last_name | first_order | most_recent_order | number_of_orders | customer_lifetime_value |
+-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+
| 1 | Michael | P. | 2018-01-01 | 2018-02-10 | 2 | 33.0000 |
| 2 | Shawn | M. | 2018-01-11 | 2018-01-11 | 1 | 23.0000 |
| 3 | Kathleen | P. | 2018-01-02 | 2018-03-11 | 3 | 65.0000 |
| 4 | Jimmy | C. | NULL | NULL | NULL | NULL |
| 5 | Katherine | R. | NULL | NULL | NULL | NULL |
| 6 | Sarah | R. | 2018-02-19 | 2018-02-19 | 1 | 8.0000 |
| 7 | Martin | M. | 2018-01-14 | 2018-01-14 | 1 | 26.0000 |
| 8 | Frank | R. | 2018-01-29 | 2018-03-12 | 2 | 45.0000 |
....
dbt also supports the generation of visual documents, using the following command.
- Generate
$ dbt docs generate
07:33:59 Running with dbt=1.0.1
07:33:59 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
07:33:59
07:33:59 Concurrency: 1 threads (target='dev')
07:33:59
07:33:59 Done.
07:33:59 Building catalog
07:33:59 Catalog written to /home/ubuntu/jaffle_shop/target/catalog.json
- Start server
$ dbt docs serve
07:43:01 Running with dbt=1.0.1
07:43:01 Serving docs at 0.0.0.0:8080
07:43:01 To access from your browser, navigate to: http://localhost:8080
07:43:01
07:43:01
07:43:01 Press Ctrl+C to exit.
The document, which contains the overall structure of the jaffle_shop project and a description of all the tables and views, can be viewed in a browser.
The use of TiDB in dbt includes the following steps:
- Install dbt and dbt-tidb
- Configuration
- Write SQL and YML files
- Running
Now, TiDB supports dbt in version 4.0 and above. However, according to the dbt-tidb project README.md, the low version of TiDB has some problems when used with dbt. Such as no support for temporary tables and temporary views, no support for the WITH syntax, etc. To get the most out of dbt, TiDB 5.3 or higher is recommended, which supports all the functions of dbt.