Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support PK and FK declarations for ER diagram and modelling purposes #3295

Closed
JC-Lightfold opened this issue Apr 26, 2021 · 11 comments
Closed
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@JC-Lightfold
Copy link

JC-Lightfold commented Apr 26, 2021

DBT needs to be able to play nice with data model diagrams!

If you point any diagramming tool at schema that DBT produces, it's not going to be able to draw connections between tables, because they all rely on the use of the CONSTRAINT DDL syntax, regardless of whether or not its enforced (this is the reason Snowflake has the syntax but doesn't enforce it). Wouldn't it be great to simply be able to declare PK and FK constraints in the model YML, which seems the EXACT place to do it?!

You can currently attempt to declare the necessary DDL with post-hooks and macros, but it's a lot of work for each and every column and table.

Being able to diagrammise structures created and maintained by DBT with more than simply the lineage of the DAG generator is critical to meeting many enterprise governance requirements that are used to using ERDs to communicate logical models with stakeholders.

Likewise, given the HUGE potential for DBT to basically become the "Universal Data Transformer" for MPP systems like Snowflake, it seems overly limited to expect that it will only ever define transient/ephemeral tables that are slices of a DAG on their way to a final analytics transformation. Don't we want to use DBT to build data vaults? Star schemas? 3NF models of any kind?

If the answer is yes, then people will expect to be able to understand the relationship model of the schemas DBT is persisting. That means diagrams using protocols like IDEF1X or crow's feet.

Recommended implementation patterns:

  1. An option is to simply annotate PK/FK status and linkages as the COMMENT field in all databases (especially Snowflake). This will still require custom parsing solutions to convert into reliable diagrams, but it's a start

  2. I think the preferred implementation is to allow the model parameters declaring fields in the model YAML to declare PK and FK CONSTRAINTS and pass that through to the DDL compiled

This issue is raised specifically with regards to how it can support standard model diagrams using ER schema, but it connects well to the following issues:

#2936
#2191
#1570

@JC-Lightfold JC-Lightfold added enhancement New feature or request triage labels Apr 26, 2021
@jtcohen6 jtcohen6 removed the triage label May 6, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented May 6, 2021

@JC-Lightfold Thanks so much for opening this issue, and for the really thoughtful comments in the original Slack thread.

You raise a really good question: Even though table constraints don't do anything in most modern data warehouses (or worse—sometimes they're used but not validated!), should dbt support table constraints as a way to backport support for older schema diagramming software?

I do think the answer here could take one of two forms:

  1. "Strict" schema validation that includes a column list, and constraints, as part of the CTA in databases that support it (Snowflake, BigQuery). This is the idea proposed in the issue you linked; models would fail if there's a mismatch between the columns specified in .yml and the columns produced by the query.
  2. Post hoc addition of constraints. This is totally doable today, with a little bit of custom work.

Wouldn't it be great to simply be able to declare PK and FK constraints in the model YML, which seems the EXACT place to do it?!

You can currently attempt to declare the necessary DDL with post-hooks and macros, but it's a lot of work for each and every column and table.

I think you can have it both ways!

Consider an approach where I define a macro, macros/add_constraints.sql (using PostgreSQL):

{% macro add_constraints() %}

    {% set constraint_sql_list = [] %}
    {% set column_dict = model.columns %}
    
    {% for column_name in column_dict %}
    
        {% set quoted_name = adapter.quote(column_name) if column_dict[column_name]['quote'] else column_name %}
        {% set constraint = column_dict[column_name]['meta']['constraint'] %}
        
        {% if constraint %}
        
            {% set constraint_name = 
                this.identifier + '_' + constraint|replace(' ','_') + '_' + column_name %}
        
            {% set constraint_sql %}
            alter table {{ this }} add constraint {{ constraint_name }}
                {{ constraint }} ({{ quoted_name }})
            {% endset %}
            
            {% do constraint_sql_list.append(constraint_sql) %}
            
        {% endif %}
        
    {% endfor %}
    
    {% do return(constraint_sql_list | join(';\n')) %}
        
{% endmacro %}

Then I have a model, models/model_a.sql:

{{ config(
    materialized = 'table',
    post_hook = "{{ add_constraints() }}"
) }}

select 1 as column_a

I define properties for it in models/schema.yml:

version: 2

models:
  - name: my_model
    columns:
      - name: column_a
        meta:
          constraint: primary key

And voila, dbt runs:

create  table "jerco"."dbt_jcohen"."my_model__dbt_tmp" as (select 1 as column_a);
alter table "jerco"."dbt_jcohen"."my_model" rename to "my_model__dbt_backup"
alter table "jerco"."dbt_jcohen"."my_model__dbt_tmp" rename to "my_model"
alter table "jerco"."dbt_jcohen"."my_model" add constraint my_model_primary_key_column_a
                primary key (column_a);
commit;
drop table if exists "jerco"."dbt_jcohen"."my_model__dbt_backup" cascade;

The meta property lets you define whichever values you want, wherever you want, entirely for your own use, whatever it may be. If it makes more sense to define those constraints as meta properties of the model, rather than the individual column, that can also work.

The downside of this approach is that you still need to specify post_hook = "{{ add_constraints() }}" in each individual model. You could also look into:

  • Reimplementing the 'table' + 'incremental' materializations to add the macro calls there
  • Reimplementing the persist_docs macro, which already loops over model columns to add descriptions as comments, to also include a step that adds constraints. As long as you have the persist_docs config turned on for columns, it would "just work." (This is how dbt-bigquery supports column-level policy tags.)

I'm curious to hear what you think! If any of these approaches feels especially promising, we could think about adding it into dbt proper.

@jtcohen6
Copy link
Contributor

jtcohen6 commented May 6, 2021

Of course, there's another question at play here: Should dbt be able to create its own ERDs?

It's totally possible from a conceptual perspective, and there's been a lot of discussion on it (dbt-labs/dbt-docs#84, discourse). These could be generated from unique + relationships tests (or from user-supplied meta properties, or new properties entirely). Personally, I'd like to see a version that:

  • Leverages dbt artifacts: manifest.json includes test definitions + properties, catalog.json includes database representations and column names
  • Creates one ERD per exposure, including the relationships between sources, models, snapshots, and seeds "exposed" there. I'm skeptical of the value of an ERD that includes hundreds of models, many in staging/intermediate layers; I feel like the exposure has a lot of potential for defining which dbt resources are user-facing, queryable, and best represented in the diagram.

@jmriego
Copy link
Contributor

jmriego commented Jul 19, 2021

@jtcohen6 Totally agree than having one ERD with hundreds of models wouldn't do much. In our case we have the most important models in one folder but it might difficult to know what most people would be doing.
In my opinion a good solution might be making the ERD creation configurable.
Something like an ERD generation command and you can pass selection flags to it. If it's for the docs, probably makes sense to configure it in some way also

@blake-enyart
Copy link

@JC-Lightfold , just curious if there has been any more traction on this? I'm super interested in this functionality.

Thank you @jtcohen6 for the solution above! Will be using that shortly on some customer projects.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added stale Issues that have gone stale and removed stale Issues that have gone stale labels Feb 25, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2022

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Sep 4, 2022
@blake-enyart
Copy link

I definitely still think this issue is relevant for documenting data architecture

@lostmygithubaccount lostmygithubaccount removed the stale Issues that have gone stale label Sep 8, 2022
@MazrimT
Copy link

MazrimT commented Jan 18, 2023

Not being able to set primary and foregin keys is basically what's keeping us from using dbt. We have requirements to produce this kind of documentation and keep it up to date, and no-one wants to do this by hand on the side that is not automatically updated.
In our case it's only for documentation to the users of the data, not to actually constrain the tables.

@jtcohen6
Copy link
Contributor

@MazrimT There's in-progress work to make this possible! (#6271)

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jul 18, 2023
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

6 participants