Skip to content

Primer Tutorial | How to use dbt with TiDB

Daemonxiao edited this page Apr 15, 2022 · 1 revision

This is a primer tutorial about how to use dbt-tidb.

Here we use the official dbt tutorial as an example to introduce the use of TiDB in dbt together.

The software used in this example and its version requirements:

  • TiDB 5.3 or upper
  • dbt 1.01 or upper
  • dbt-tidb 1.0.0

Installation

Installing dbt and dbt-tidb requires only one command because dbt is installed as a dependency when we install dbt-tidb.

$ pip install dbt-tidb

dbt can also be installed by yourself. Please refer to the official installation guide.

Creating project: jaffle shop

The jaffle_shop is a project provided by dbt-lab to demonstrate dbt functionality. You can get it directly from GitHub.

$ git clone https://github.com/dbt-labs/jaffle_shop
$ cd jaffle_shop

All files in the jaffle_shop project directory are as follows.

  • dbt_project.yml is the dbt project configuration file, which holds the project name, database configuration file path information, etc.
  • models directory contains the SQL model and table schema for the project. Note that this section was written by the data analyst himself.
  • seed directory stores CSV files. Such CSV files can be dumped from database export tools. For example, TiDB can export the table data into CSV files through Dumpling. In the jaffle shop project, these CSV files are used as raw data to be processed.

I'll go into more detail about them later when I use one of the files or directories above.

ubuntu@ubuntu:~/jaffle_shop$ tree
.
├── dbt_project.yml
├── etc
│   ├── dbdiagram_definition.txt
│   └── jaffle_shop_erd.png
├── LICENSE
├── models
│   ├── customers.sql
│   ├── docs.md
│   ├── orders.sql
│   ├── overview.md
│   ├── schema.yml
│   └── staging
│       ├── schema.yml
│       ├── stg_customers.sql
│       ├── stg_orders.sql
│       └── stg_payments.sql
├── README.md
└── seeds
    ├── raw_customers.csv
    ├── raw_orders.csv
    └── raw_payments.csv

Configuring project

  1. Global configuration

dbt has a default global profile: ~/.dbt/profiles.yml, which we first set up in the user directory and configured the connection information for the TiDB database.

 $ vi ~/.dbt/profiles.yml
 jaffle_shop_tidb:                        # project name
  target: dev                             # target
  outputs:
    dev:
      type: tidb                         # adapter type
      server: 127.0.0.1                  
      port: 4000                         
      schema: analytics                  # database name
      username: root                     
      password: ""                       
  1. Project configuration

In the jaffle_shop project directory, the project configuration file dbt_project.yml is available. Change the profile configuration item to jaffle_shop_tidb, which is the project name in profiles.yml. Then, the project will query the database connection configuration in the ~/.dbt/profiles.yml file.

$ cat dbt_project.yml 
name: 'jaffle_shop'

config-version: 2
version: '0.1'

profile: 'jaffle_shop_tidb'                   # note the modification here

model-paths: ["models"]                       # model path
seed-paths: ["seeds"]                         # seed path
test-paths: ["tests"]                         
analysis-paths: ["analysis"]
macro-paths: ["macros"]

target-path: "target"
clean-targets:
    - "target"
    - "dbt_modules"
    - "logs"

require-dbt-version: [">=1.0.0", "<2.0.0"]

models:
  jaffle_shop:
      materialized: table            # *.sql which in models/ would be materialized to table
      staging:           
        materialized: view           # *.sql which in models/staging/ would bt materialized to view
  1. Verify the configuration

You can run the following command to check whether the database and project configuration is correct.

$ dbt debug
06:59:18  Running with dbt=1.0.1
dbt version: 1.0.1
python version: 3.8.10
python path: /usr/bin/python3
os info: Linux-5.4.0-97-generic-x86_64-with-glibc2.29
Using profiles.yml file at /home/ubuntu/.dbt/profiles.yml
Using dbt_project.yml file at /home/ubuntu/jaffle_shop/dbt_project.yml

Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]

Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]

Required dependencies:
 - git [OK found]

Connection:
  server: 127.0.0.1
  port: 4000
  database: None
  schema: analytics
  user: root
  Connection test: [OK connection ok]

All checks passed!

Loading CSV

Load the CSV data and materialize the CSV as a table in the target database. Note: In general, dbt projects do not need this step because the data for your pending projects is in the database.

$ dbt seed
07:03:24  Running with dbt=1.0.1
07:03:24  Partial parse save file not found. Starting full parse.
07:03:25  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
07:03:25
07:03:25  Concurrency: 1 threads (target='dev')
07:03:25
07:03:25  1 of 3 START seed file analytics.raw_customers.................................. [RUN]
07:03:25  1 of 3 OK loaded seed file analytics.raw_customers.............................. [INSERT 100 in 0.19s]
07:03:25  2 of 3 START seed file analytics.raw_orders..................................... [RUN]
07:03:25  2 of 3 OK loaded seed file analytics.raw_orders................................. [INSERT 99 in 0.14s]
07:03:25  3 of 3 START seed file analytics.raw_payments................................... [RUN]
07:03:26  3 of 3 OK loaded seed file analytics.raw_payments............................... [INSERT 113 in 0.24s]
07:03:26
07:03:26  Finished running 3 seeds in 0.71s.
07:03:26
07:03:26  Completed successfully
07:03:26
07:03:26  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

In the above result, it is clear that three tasks have been performed and three tables ( analytics.raw_customers, analytics.raw_orders, analytics.raw_payments) were loaded.

Next, go to the TiDB database to see what happens.

We found an extra analytics database, which is the engineering database that dbt created for us.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| INFORMATION_SCHEMA |
| METRICS_SCHEMA     |
| PERFORMANCE_SCHEMA |
| analytics          |
| mysql              |
| test               |
+--------------------+
6 rows in set (0.00 sec)

There are three tables in analytics database, corresponding to the above three task results respectively.

mysql> show tables;
+---------------------+
| Tables_in_analytics |
+---------------------+
| raw_customers       |
| raw_orders          |
| raw_payments        |
+---------------------+
3 rows in set (0.00 sec)

model

Before skipping to the next step, it's important to understand what role the model plays in dbt.

In dbt, models are used to describe the structure of a set of tables or views, and there are two main types of files: SQL and YML. Also note that in the jaffle_shop project, table structures are held under the models/ directory and view structures are held under the models/staging/ directory according to the materialization configuration.

For example, models/orders.sql is an SQL query statement with jinja syntax. We will create a table by this query statement.

$ cat models/orders.sql
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}

with orders as (

    select * from {{ ref('stg_orders') }}

),

payments as (

    select * from {{ ref('stg_payments') }}

),

order_payments as (

    select
        order_id,

        {% for payment_method in payment_methods -%}
        sum(case when payment_method = '{{ payment_method }}' then amount else 0 end) as {{ payment_method }}_amount,
        {% endfor -%}

        sum(amount) as total_amount

    from payments

    group by order_id

),

final as (

    select
        orders.order_id,
        orders.customer_id,
        orders.order_date,
        orders.status,

        {% for payment_method in payment_methods -%}

        order_payments.{{ payment_method }}_amount,

        {% endfor -%}

        order_payments.total_amount as amount

    from orders


    left join order_payments
        on orders.order_id = order_payments.order_id

)

select * from final

Also, the constraint information that goes with this SQL is in the models/schema.yml file.

The schema.yml is a registry of all models in the current directory. All models are organized into a tree structure that describes the description and attributes of each field. The tests items represent constraints for this field, which can be tested using the dbt test command. See the official documentation for more information.

cat models/schema.yml
version: 2
...
  - name: orders
    description: This table has basic information about orders, as well as some derived facts based on payments

    columns:
      - name: order_id
        tests:
          - unique
          - not_null
        description: This is a unique identifier for an order

      - name: customer_id
        description: Foreign key to the customers table
        tests:
          - not_null
          - relationships:
              to: ref('customers')
              field: customer_id

      - name: order_date
        description: Date (UTC) that the order was placed

      - name: status
        description: '{{ doc("orders_status") }}'
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']

      - name: amount
        description: Total amount (AUD) of the order
        tests:
          - not_null

      - name: credit_card_amount
        description: Amount of the order (AUD) paid for by credit card
        tests:
          - not_null

      - name: coupon_amount
        description: Amount of the order (AUD) paid for by coupon
        tests:
          - not_null

      - name: bank_transfer_amount
        description: Amount of the order (AUD) paid for by bank transfer
        tests:
          - not_null

      - name: gift_card_amount
        description: Amount of the order (AUD) paid for by gift card
        tests:
          - not_null

Running

The result shows three views (analytics.stg_customers, analytics.stg_orders, analytics.stg_payments) and two tables (analytics.customers, analytics.orders).

$ dbt run
07:28:43  Running with dbt=1.0.1
07:28:43  Unable to do partial parsing because profile has changed
07:28:43  Unable to do partial parsing because a project dependency has been added
07:28:44  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
07:28:44
07:28:44  Concurrency: 1 threads (target='dev')
07:28:44
07:28:44  1 of 5 START view model analytics.stg_customers................................. [RUN]
07:28:44  1 of 5 OK created view model analytics.stg_customers............................ [SUCCESS 0 in 0.12s]
07:28:44  2 of 5 START view model analytics.stg_orders.................................... [RUN]
07:28:44  2 of 5 OK created view model analytics.stg_orders............................... [SUCCESS 0 in 0.08s]
07:28:44  3 of 5 START view model analytics.stg_payments.................................. [RUN]
07:28:44  3 of 5 OK created view model analytics.stg_payments............................. [SUCCESS 0 in 0.07s]
07:28:44  4 of 5 START table model analytics.customers.................................... [RUN]
07:28:44  4 of 5 OK created table model analytics.customers............................... [SUCCESS 0 in 0.16s]
07:28:44  5 of 5 START table model analytics.orders....................................... [RUN]
07:28:45  5 of 5 OK created table model analytics.orders.................................. [SUCCESS 0 in 0.12s]
07:28:45
07:28:45  Finished running 3 view models, 2 table models in 0.64s.
07:28:45
07:28:45  Completed successfully
07:28:45
07:28:45  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5

Go to the TiDB database to verify that the creation is successful.

The result illustrates that five more tables or views, such as customers, have been added, and the data in the tables or views have been transformed. Only part of customers data is shown here.

mysql> show tables;
+---------------------+
| Tables_in_analytics |
+---------------------+
| customers           |
| orders              |
| raw_customers       |
| raw_orders          |
| raw_payments        |
| stg_customers       |
| stg_orders          |
| stg_payments        |
+---------------------+
8 rows in set (0.00 sec)

mysql> select * from customers;
+-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+
| customer_id | first_name | last_name | first_order | most_recent_order | number_of_orders | customer_lifetime_value |
+-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+
|           1 | Michael    | P.        | 2018-01-01  | 2018-02-10        |                2 |                 33.0000 |
|           2 | Shawn      | M.        | 2018-01-11  | 2018-01-11        |                1 |                 23.0000 |
|           3 | Kathleen   | P.        | 2018-01-02  | 2018-03-11        |                3 |                 65.0000 |
|           4 | Jimmy      | C.        | NULL        | NULL              |             NULL |                    NULL |
|           5 | Katherine  | R.        | NULL        | NULL              |             NULL |                    NULL |
|           6 | Sarah      | R.        | 2018-02-19  | 2018-02-19        |                1 |                  8.0000 |
|           7 | Martin     | M.        | 2018-01-14  | 2018-01-14        |                1 |                 26.0000 |
|           8 | Frank      | R.        | 2018-01-29  | 2018-03-12        |                2 |                 45.0000 |
....

Generating doc

dbt also supports the generation of visual documents, using the following command.

  1. Generate
$ dbt docs generate
07:33:59  Running with dbt=1.0.1
07:33:59  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
07:33:59
07:33:59  Concurrency: 1 threads (target='dev')
07:33:59
07:33:59  Done.
07:33:59  Building catalog
07:33:59  Catalog written to /home/ubuntu/jaffle_shop/target/catalog.json
  1. Start server
$ dbt docs serve
07:43:01  Running with dbt=1.0.1
07:43:01  Serving docs at 0.0.0.0:8080
07:43:01  To access from your browser, navigate to:  http://localhost:8080
07:43:01
07:43:01
07:43:01  Press Ctrl+C to exit.

The document, which contains the overall structure of the jaffle_shop project and a description of all the tables and views, can be viewed in a browser.

Conclusion

The use of TiDB in dbt includes the following steps:

  1. Install dbt and dbt-tidb
  2. Configuration
  3. Write SQL and YML files
  4. Running

Now, TiDB supports dbt in version 4.0 and above. However, according to the dbt-tidb project README.md, the low version of TiDB has some problems when used with dbt. Such as no support for temporary tables and temporary views, no support for the WITH syntax, etc. To get the most out of dbt, TiDB 5.3 or higher is recommended, which supports all the functions of dbt.