diff --git a/website/docs/guides/best-practices/how-we-style/0-how-we-style-our-dbt-projects.md b/website/docs/guides/best-practices/how-we-style/0-how-we-style-our-dbt-projects.md new file mode 100644 index 00000000000..dd695af2602 --- /dev/null +++ b/website/docs/guides/best-practices/how-we-style/0-how-we-style-our-dbt-projects.md @@ -0,0 +1,29 @@ +--- +title: How we style our dbt projects +id: 0-how-we-style-our-dbt-projects +--- + +## Why does style matter? + +Style might seem like a trivial, surface-level issue, but it's a deeply material aspect of a well-built project. A consistent, clear style enhances readability and makes your project easier to understand and maintain. Highly readable code helps build clear mental models making it easier to debug and extend your project. It's not just a favor to yourself, though; equally importantly, it makes it less effort for others to understand and contribute to your project, which is essential for peer collaboration, open-source work, and onboarding new team members. [A style guide lets you focus on what matters](https://mtlynch.io/human-code-reviews-1/#settle-style-arguments-with-a-style-guide), the logic and impact of your project, rather than the superficialities of how it's written. This brings harmony and pace to your team's work, and makes reviews more enjoyable and valuable. + +## What's important about style? + +There are two crucial tenets of code style: + +- Clarity +- Consistency + +Style your code in such a way that you can quickly read and understand it. It's also important to consider code review and git diffs. If you're making a change to a model, you want reviewers to see just the material changes you're making clearly. + +Once you've established a clear style, stay consistent. This is the most important thing. Everybody on your team needs to have a unified style, which is why having a style guide is so crucial. If you're writing a model, you should be able to look at other models in the project that your teammates have written and read in the same style. If you're writing a macro or a test, you should see the same style as your models. Consistency is key. + +## How should I style? + +You should style the project in a way you and your teammates or collaborators agree on. The most important thing is that you have a style guide and stick to it. This guide is just a suggestion to get you started and to give you a sense of what a style guide might look like. It covers various areas you may want to consider, with suggested rules. It emphasizes lots of whitespace, clarity, clear naming, and comments. + +We believe one of the strengths of SQL is that it reads like English, so we lean into that declarative nature throughout our projects. Even within dbt Labs, though, there are differing opinions on how to style, even a small but passionate contingent of leading comma enthusiasts! Again, the important thing is not to follow this style guide; it's to make _your_ style guide and follow it. Lastly, be sure to include rules, tools, _and_ examples in your style guide to make it as easy as possible for your team to follow. + +## Automation + +Use formatters and linters as much as possible. We're all human, we make mistakes. Not only that, but we all have different preferences and opinions while writing code. Automation is a great way to ensure that your project is styled consistently and correctly and that people can write in a way that's quick and comfortable for them, while still getting perfectly consistent output. diff --git a/website/docs/guides/best-practices/how-we-style/1-how-we-style-our-dbt-models.md b/website/docs/guides/best-practices/how-we-style/1-how-we-style-our-dbt-models.md new file mode 100644 index 00000000000..0157af63cfb --- /dev/null +++ b/website/docs/guides/best-practices/how-we-style/1-how-we-style-our-dbt-models.md @@ -0,0 +1,66 @@ +--- +title: How we style our dbt models +id: 1-how-we-style-our-dbt-models +--- + +## Fields and model names + +- 👥 Models should be pluralized, for example, `customers`, `orders`, `products`. +- 🔑 Each model should have a primary key. +- 🔑 The primary key of a model should be named `_id`, for example, `account_id`. This makes it easier to know what `id` is being referenced in downstream joined models. +- 🔑 Keys should be string data types. +- 🔑 Consistency is key! Use the same field names across models where possible. For example, a key to the `customers` table should be named `customer_id` rather than `user_id` or 'id'. +- ❌ Do not use abbreviations or aliases. Emphasize readability over brevity. For example, do not use `cust` for `customer` or `o` for `orders`. +- ❌ Avoid reserved words as column names. +- ➕ Booleans should be prefixed with `is_` or `has_`. +- 🕰️ Timestamp columns should be named `_at`(for example, `created_at`) and should be in UTC. If a different timezone is used, this should be indicated with a suffix (`created_at_pt`). +- 📆 Dates should be named `_date`. For example, `created_date.` +- 🔙 Events dates and times should be past tense — `created`, `updated`, or `deleted`. +- 💱 Price/revenue fields should be in decimal currency (`19.99` for $19.99; many app databases store prices as integers in cents). If a non-decimal currency is used, indicate this with a suffix (`price_in_cents`). +- 🐍 Schema, table and column names should be in `snake_case`. +- 🏦 Use names based on the _business_ terminology, rather than the source terminology. For example, if the source database uses `user_id` but the business calls them `customer_id`, use `customer_id` in the model. +- 🔢 Versions of models should use the suffix `_v1`, `_v2`, etc for consistency (`customers_v1` and `customers_v2`). +- 🗄️ Use a consistent ordering of data types and consider grouping and labeling columns by type, as in the example below. This will minimize join errors and make it easier to read the model, as well as help downstream consumers of the data understand the data types and scan models for the columns they need. We prefer to use the following order: ids, strings, numerics, booleans, dates, and timestamps. + +## Example model + +```sql +with + +source as ( + + select * from {{ source('ecom', 'raw_orders') }} + +), + +renamed as ( + + select + + ---------- ids + id as order_id, + store_id as location_id, + customer as customer_id, + + ---------- strings + status as order_status, + + ---------- numerics + (order_total / 100.0)::float as order_total, + (tax_paid / 100.0)::float as tax_paid, + + ---------- booleans + is_fulfilled, + + ---------- dates + date(order_date) as ordered_date, + + ---------- timestamps + ordered_at + + from source + +) + +select * from renamed +``` diff --git a/website/docs/guides/best-practices/how-we-style/2-how-we-style-our-sql.md b/website/docs/guides/best-practices/how-we-style/2-how-we-style-our-sql.md new file mode 100644 index 00000000000..1ea9c064d74 --- /dev/null +++ b/website/docs/guides/best-practices/how-we-style/2-how-we-style-our-sql.md @@ -0,0 +1,183 @@ +--- +title: How we style our SQL +id: 2-how-we-style-our-sql +--- + +## Basics + +- ☁️ Use [SQLFluff](https://sqlfluff.com/) to maintain these style rules automatically. + - Reference this [SQLFluff config file](https://github.com/dbt-labs/jaffle-shop-template/blob/main/.sqlfluff) for the rules we use. +- 👻 Use Jinja comments (`{# #}`) for comments that should not be included in the compiled SQL. +- ⏭️ Use trailing commas. +- 4️⃣ Indents should be four spaces. +- 📏 Lines of SQL should be no longer than 80 characters. +- ⬇️ Field names, keywords, and function names should all be lowercase. +- 🫧 The `as` keyword should be used explicitly when aliasing a field or table. + +:::info +☁️ dbt Cloud users can use the built-in [SQLFluff Cloud IDE integration](https://docs.getdbt.com/docs/cloud/dbt-cloud-ide/lint-format) to automatically lint and format their SQL. The default style sheet is based on dbt Labs style as outlined in this guide, but you can customize this to fit your needs. No need to setup any external tools, just hit `Lint`! Also, the more opinionated [sqlfmt](http://sqlfmt.com/) formatter is also available if you prefer that style. +::: + +## Fields, aggregations, and grouping + +- 🔙 Fields should be stated before aggregates and window functions. +- 🤏🏻 Aggregations should be executed as early as possible (on the smallest data set possible) before joining to another table to improve performance. +- 🔢 Ordering and grouping by a number (eg. group by 1, 2) is preferred over listing the column names (see [this classic rant](https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/) for why). Note that if you are grouping by more than a few columns, it may be worth revisiting your model design. + +## Joins + +- 👭🏻 Prefer `union all` to `union` unless you explicitly want to remove duplicates. +- 👭🏻 If joining two or more tables, _always_ prefix your column names with the table name. If only selecting from one table, prefixes are not needed. +- 👭🏻 Be explicit about your join type (i.e. write `inner join` instead of `join`). +- 🥸 Avoid table aliases in join conditions (especially initialisms) — it's harder to understand what the table called "c" is as compared to "customers". +- ➡️ Always move left to right to make joins easy to reason about - `right joins` often indicate that you should change which table you select `from` and which one you `join` to. + +## 'Import' CTEs + +- 🔝 All `{{ ref('...') }}` statements should be placed in CTEs at the top of the file. +- 📦 'Import' CTEs should be named after the table they are referencing. +- 🤏🏻 Limit the data scanned by CTEs as much as possible. Where possible, only select the columns you're actually using and use `where` clauses to filter out unneeded data. +- For example: + +```sql +with + +orders as ( + + select + order_id, + customer_id, + order_total, + order_date + + from {{ ref('orders') }} + + where order_date >= '2020-01-01' + +) +``` + +## 'Functional' CTEs + +- ☝🏻 Where performance permits, CTEs should perform a single, logical unit of work. +- 📖 CTE names should be as verbose as needed to convey what they do e.g. `events_joined_to_users` instead of `user_events` (this could be a good model name, but does not describe a specific function or transformation). +- 🌉 CTEs that are duplicated across models should be pulled out into their own intermediate models. Look out for chunks of repeated logic that should be refactored into their own model. +- 🔚 The last line of a model should be a `select *` from your final output CTE. This makes it easy to materialize and audit the output from different steps in the model as you're developing it. You just change the CTE referenced in the `select` statement to see the output from that step. + +## Model configuration + +- 📝 Model-specific attributes (like sort/dist keys) should be specified in the model. +- 📂 If a particular configuration applies to all models in a directory, it should be specified in the `dbt_project.yml` file. +- 👓 In-model configurations should be specified like this for maximum readability: + +```sql +{{ + config( + materialized = 'table', + sort = 'id', + dist = 'id' + ) +}} +``` + +## Example SQL + +```sql +with + +events as ( + + ... + +), + +{# CTE comments go here #} +filtered_events as ( + + ... + +) + +select * from filtered_events +``` + +### Example SQL + +```sql +with + +my_data as ( + + select + field_1, + field_2, + field_3, + cancellation_date, + expiration_date, + start_date + + from {{ ref('my_data') }} + +), + +some_cte as ( + + select + id, + field_4, + field_5 + + from {{ ref('some_cte') }} + +), + +some_cte_agg as ( + + select + id, + sum(field_4) as total_field_4, + max(field_5) as max_field_5 + + from some_cte + + group by 1 + +), + +joined as ( + + select + my_data.field_1, + my_data.field_2, + my_data.field_3, + + -- use line breaks to visually separate calculations into blocks + case + when my_data.cancellation_date is null + and my_data.expiration_date is not null + then expiration_date + when my_data.cancellation_date is null + then my_data.start_date + 7 + else my_data.cancellation_date + end as cancellation_date, + + some_cte_agg.total_field_4, + some_cte_agg.max_field_5 + + from my_data + + left join some_cte_agg + on my_data.id = some_cte_agg.id + + where my_data.field_1 = 'abc' and + ( + my_data.field_2 = 'def' or + my_data.field_2 = 'ghi' + ) + + having count(*) > 1 + +) + +select * from joined +``` diff --git a/website/docs/guides/best-practices/how-we-style/3-how-we-style-our-python.md b/website/docs/guides/best-practices/how-we-style/3-how-we-style-our-python.md new file mode 100644 index 00000000000..5443abf302d --- /dev/null +++ b/website/docs/guides/best-practices/how-we-style/3-how-we-style-our-python.md @@ -0,0 +1,44 @@ +--- +title: How we style our Python +id: 3-how-we-style-our-python +--- + +## Python tooling + +- 🐍 Python has a more mature and robust ecosystem for formatting and linting (helped by the fact that it doesn't have a million distinct dialects). We recommend using those tools to format and lint your code in the style you prefer. + +- 🛠️ Our current recommendations are + + - [black](https://pypi.org/project/black/) formatter + - [ruff](https://pypi.org/project/ruff/) linter + + :::info + ☁️ dbt Cloud comes with the [black formatter built-in](https://docs.getdbt.com/docs/cloud/dbt-cloud-ide/lint-format) to automatically lint and format their SQL. You don't need to download or configure anything, just click `Format` in a Python model and you're good to go! + ::: + +## Example Python + +```python +import pandas as pd + + +def model(dbt, session): + # set length of time considered a churn + pd.Timedelta(days=2) + + dbt.config(enabled=False, materialized="table", packages=["pandas==1.5.2"]) + + orders_relation = dbt.ref("stg_orders") + + # converting a DuckDB Python Relation into a pandas DataFrame + orders_df = orders_relation.df() + + orders_df.sort_values(by="ordered_at", inplace=True) + orders_df["previous_order_at"] = orders_df.groupby("customer_id")[ + "ordered_at" + ].shift(1) + orders_df["next_order_at"] = orders_df.groupby("customer_id")["ordered_at"].shift( + -1 + ) + return orders_df +``` diff --git a/website/docs/guides/best-practices/how-we-style/4-how-we-style-our-jinja.md b/website/docs/guides/best-practices/how-we-style/4-how-we-style-our-jinja.md new file mode 100644 index 00000000000..3a969d2bdd3 --- /dev/null +++ b/website/docs/guides/best-practices/how-we-style/4-how-we-style-our-jinja.md @@ -0,0 +1,37 @@ +--- +title: How we style our Jinja +id: 4-how-we-style-our-jinja +--- + +## Jinja style guide + +- 🫧 When using Jinja delimiters, use spaces on the inside of your delimiter, like `{{ this }}` instead of `{{this}}` +- 🆕 Use newlines to visually indicate logical blocks of Jinja. +- 4️⃣ Indent 4 spaces into a Jinja block to indicate visually that the code inside is wrapped by that block. +- ❌ Don't worry (too much) about Jinja whitespace control, focus on your project code being readable. The time you save by not worrying about whitespace control will far outweigh the time you spend in your compiled code where it might not be perfect. + +## Examples of Jinja style + +```jinja +{% macro make_cool(uncool_id) %} + + do_cool_thing({{ uncool_id }}) + +{% endmacro %} +``` + +```sql +select + entity_id, + entity_type, + {% if this %} + + {{ that }}, + + {% else %} + + {{ the_other_thing }}, + + {% endif %} + {{ make_cool('uncool_id') }} as cool_id +``` diff --git a/website/docs/guides/best-practices/how-we-style/5-how-we-style-our-yaml.md b/website/docs/guides/best-practices/how-we-style/5-how-we-style-our-yaml.md new file mode 100644 index 00000000000..323ed3ac11d --- /dev/null +++ b/website/docs/guides/best-practices/how-we-style/5-how-we-style-our-yaml.md @@ -0,0 +1,44 @@ +--- +title: How we style our YAML +id: 5-how-we-style-our-yaml +--- + +## YAML Style Guide + +- 2️⃣ Indents should be two spaces +- ➡️ List items should be indented +- 🆕 Use a new line to separate list items that are dictionaries where appropriate +- 📏 Lines of YAML should be no longer than 80 characters. +- 🛠️ Use the [dbt JSON schema](https://github.com/dbt-labs/dbt-jsonschema) with any compatible IDE and a YAML formatter (we recommend [Prettier](https://prettier.io/) to validate your YAML files and format them automatically. + +:::info +☁️ As with Python and SQL, the dbt Cloud IDE comes with built-in formatting for YAML files (Markdown and JSON too!), via Prettier. Just click the `Format` button and you're in perfect style. As with the other tools, you can [also customize the formatting rules](https://docs.getdbt.com/docs/cloud/dbt-cloud-ide/lint-format#format-yaml-markdown-json) to your liking to fit your company's style guide. +::: + +### Example YAML + +```yaml +version: 2 + +models: + - name: events + columns: + - name: event_id + description: This is a unique identifier for the event + tests: + - unique + - not_null + + - name: event_time + description: "When the event occurred in UTC (eg. 2018-01-01 12:00:00)" + tests: + - not_null + + - name: user_id + description: The ID of the user who recorded the event + tests: + - not_null + - relationships: + to: ref('users') + field: id +``` diff --git a/website/docs/guides/best-practices/how-we-style/6-how-we-style-conclusion.md b/website/docs/guides/best-practices/how-we-style/6-how-we-style-conclusion.md new file mode 100644 index 00000000000..22f8e36190a --- /dev/null +++ b/website/docs/guides/best-practices/how-we-style/6-how-we-style-conclusion.md @@ -0,0 +1,12 @@ +--- +title: Now it's your turn +id: 6-how-we-style-conclusion +--- + +## BYO Styles + +Now that you've seen how we style our dbt projects, it's time to build your own. Feel free to copy this guide and use it as a template for your own project. If you do, we'd love to hear about it! Reach out to us on [the Community Forum](https://discourse.getdbt.com/c/show-and-tell/22) or [Slack](https://www.getdbt.com/community) to share your style guide. We recommend co-locating your style guide with your code to make sure contributors can easily follow it. If you're using GitHub, you can add your style guide to your repository's wiki, or include it in your README. + +## Pre-commit hooks + +Lastly, to ensure your style guide's automated rules are being followed without additional mental overhead to your team, you can use [pre-commit hooks](https://pre-commit.com/) to automatically check your code for style violations (and often fix them automagically) before it's committed. This is a great way to make sure your style guide is followed by all contributors. We recommend implementing this once you've settled on and published your style guide, and your codebase is conforming to it. This will ensure that all future commits follow the style guide. You can find an excellent set of open source pre-commit hooks for dbt from the community [here in the dbt-checkpoint project](https://github.com/dbt-checkpoint/dbt-checkpoint). diff --git a/website/sidebars.js b/website/sidebars.js index b37f3562397..3198d95e0f3 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -849,6 +849,22 @@ const sidebarSettings = { "guides/best-practices/how-we-structure/5-the-rest-of-the-project", ], }, + { + type: "category", + label: "How we style our dbt projects", + link: { + type: "doc", + id: "guides/best-practices/how-we-style/0-how-we-style-our-dbt-projects", + }, + items: [ + "guides/best-practices/how-we-style/1-how-we-style-our-dbt-models", + "guides/best-practices/how-we-style/2-how-we-style-our-sql", + "guides/best-practices/how-we-style/3-how-we-style-our-python", + "guides/best-practices/how-we-style/4-how-we-style-our-jinja", + "guides/best-practices/how-we-style/5-how-we-style-our-yaml", + "guides/best-practices/how-we-style/6-how-we-style-conclusion", + ], + }, { type: "category", label: "Materializations best practices",