strict field validation for schema.yml #1570

jwerderits · 2019-06-25T21:48:21Z

Issue: field validation for schema.yml

Issue description

Invalid/nonexistent field names can have descriptions in schema.yml that are populated in the documentation.
If a field is deleted, but was previously documented (correctly), the docs will indicate that that field still exists.

Fix

Apply column validation for fields that don't have tests applied to them

Steps to reproduce

You can create any column name and add a description in schema.yml and it will be populated in the documentation

    columns:
        - name: deleted_column_name
          description: this field has been deleted

The text was updated successfully, but these errors were encountered:

drewbanin · 2019-06-25T22:05:03Z

Hey @jwerderits - thanks for making this issue! How do you think dbt should handle this in practice? Do you think a dbt run should fail? Or dbt docs generate? Or something else?

jwerderits · 2019-06-26T01:28:29Z

@drewbanin I think it makes the most sense for a dbt test to fail on this condition because it is already involved with schema validation. Encapsulating the logic here provides a warning that there are errors within the project without preventing a run or docs generate to fail. It might also be useful to dbt docs generate to help find where the erroneous field(s) exists.

jack-arthurton-cko · 2019-07-03T07:24:39Z

This would be really useful! Would it be possible to extend the validation to check for the opposite scenario, columns which exist in the model but don't have entry in the schema.yml file?

drewbanin · 2019-07-03T15:58:59Z

@JackArthurton yeah! I think that's a great idea. I'm imagining that this would be opt-in, so you could annotate a schema.yml specification with strict: true (or similar) which would throw an error if the schema specification and the model are mismatched

mattm · 2019-07-30T18:11:19Z

Drew pointed me to this issue when I asked about this on Slack.

My question was:

Is there a way to configure tests to identify columns that are in the results but not covered by a schema test?

Love the strict: true idea - that seems like it would do the trick.

stumelius · 2019-11-08T10:54:42Z

Any progress regarding field validation?

I'd like to be able to define the schema like this:

models:
  - name: my_model
    columns:
      - name: column_1
         tests:
           - exists

drewbanin · 2019-11-08T14:34:02Z

hey @smomni - we haven't prioritized this one yet! If you're in a pinch, you can actually define your own custom schema test in your dbt project: https://docs.getdbt.com/docs/custom-schema-tests

If you create a macro in your macros/ directory like this:

{% macro test_exists(model, column_name) %}

    select count({{ column_name }})
    from {{ model }}
    where 1=0
    limit 1

{% endmacro %}

Then you'll be able to assert that columns exist with:

models:
  - name: my_model
    columns:
      - name: column_1
         tests:
           - exists

I think the version of this that we add to dbt natively will be a little bit smarter than this. We can just run a single query to find all of the columns in a table, then check them against the columns in the schema file. The schema test i shared above will run one query per column which probably isn't optimal

mjhoshea · 2020-01-29T15:19:41Z

Hey guys this looks like it could be a really useful idea. For some context - we want to create a self service ELT pipeline whereby people can create their own datasets from upstream sources. In order to mitigate the risk of undocumented datasets entering the warehouse it would be nice to enforce people to document the newly created tables and views with descriptions.

The ideal use would be a project level flag that turns on the necessity of schemas to have descriptions. This would then be used as part of CI/CD to stop undocumented schemas getting to production.

ArafathC · 2020-05-06T20:43:35Z

hey @smomni - we haven't prioritized this one yet! If you're in a pinch, you can actually define your own custom schema test in your dbt project: https://docs.getdbt.com/docs/custom-schema-tests

If you create a macro in your macros/ directory like this:
{% macro test_exists(model, column_name) %}

    select count({{ column_name }})
    from {{ model }}
    where 1=0
    limit 1

{% endmacro %}
Then you'll be able to assert that columns exist with:
models:
  - name: my_model
    columns:
      - name: column_1
         tests:
           - exists
I think the version of this that we add to dbt natively will be a little bit smarter than this. We can just run a single query to find all of the columns in a table, then check them against the columns in the schema file. The schema test i shared above will run one query per column which probably isn't optimal

this approach doesn't fail if the column is not present in the table

drewbanin · 2020-05-07T23:44:59Z

@ArafathC are you sure about that? The query should fail because the specified column does not exist in the table. I'm curious if you could elaborate about what you mean

ArafathC · 2020-05-08T03:22:27Z

@drewbanin Sorry about that.. Was able to double check and confirm it works

github-actions · 2021-12-21T01:51:50Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

alexrosenfeld10 · 2022-02-20T02:19:20Z

Hey @drewbanin any activity here? I would love to make use of such a feature.

Internally, I wrote an app that does all this kind of checking on our existing reporting warehouse, which doesn't use dbt. I'm migrating it over to use dbt, and this is definitely a gap we're seeing. Thanks in advance for any updates / help

alexrosenfeld10 · 2022-02-20T02:32:16Z

Also, sidenote, I do think there's an error in the macro posted above. It needs to be limit 0, because otherwise it will return one row with a value of 0, and the fact that it returns rows means the test will fail.

alexrosenfeld10 · 2022-02-20T02:39:08Z

Just found this: https://github.com/calogica/dbt-expectations/tree/0.5.1/#table-shape it's pretty awesome. Gets to most of what I need here. 👍

benhinchley · 2022-02-28T08:28:17Z

This is what I've been using to check if all the columns defined exist.

{% test columns_exist(model) %}
    {% if execute %}
        {% set models = [] %}
        {% for node in graph.nodes.values() | selectattr("resource_type", "equalto", "model") | selectattr("name", "equalto", model.identifier) %}
            {% do models.append(node) %}
        {% endfor %}

        {% if models|count != 1 %}
            {{ exceptions.raise_compiler_error(model.identifier ~ " not found in graph") }}
        {% endif %}

        {% set test_model = models|first %}

        with column_check as (
            select
            {% for column_name in test_model.columns %}
                {{ column_name }},
            {% endfor %}
            from {{ model }}
            where 1=0
            limit 1
        )

        select * from column_check
    {% endif %}
{% endtest %}

yummydum · 2022-04-09T08:00:04Z

I think this feature is very useful as well. It would be very nice if I can enforce the followings.

Every column in the yaml file exists in the database
All columns in the database are in the yaml file
There is a description on the column

Are there plans to prioritize this one @drewbanin ? Seems like dbt exception does not cover this use case.

schylarbrock · 2022-05-16T15:11:27Z

benhinchley

Thank you for the help! One quick thing: I think you will have an error with trailing commas here (for the last column). I believe it should be updated to (edit on line 17):

{% test columns_exist(model) %}
    {% if execute %}
        {% set models = [] %}
        {% for node in graph.nodes.values() | selectattr("resource_type", "equalto", "model") | selectattr("name", "equalto", model.identifier) %}
            {% do models.append(node) %}
        {% endfor %}

        {% if models|count != 1 %}
            {{ exceptions.raise_compiler_error(model.identifier ~ " not found in graph") }}
        {% endif %}

        {% set test_model = models|first %}

        with column_check as (
            select
            {% for column_name in test_model.columns %}
                {{ column_name }} {% if not loop.last %},{% endif %}
            {% endfor %}
            from {{ model }}
            where 1=0
            limit 1
        )

        select * from column_check
    {% endif %}
{% endtest %}

benhinchley · 2022-05-17T01:29:31Z

benhinchley

Thank you for the help! One quick thing: I think you will have an error with trailing commas here (for the last column). I believe it should be updated to (edit on line 17):


{% test columns_exist(model) %}

    {% if execute %}

        {% set models = [] %}

        {% for node in graph.nodes.values() | selectattr("resource_type", "equalto", "model") | selectattr("name", "equalto", model.identifier) %}

            {% do models.append(node) %}

        {% endfor %}



        {% if models|count != 1 %}

            {{ exceptions.raise_compiler_error(model.identifier ~ " not found in graph") }}

        {% endif %}



        {% set test_model = models|first %}



        with column_check as (

            select

            {% for column_name in test_model.columns %}

                {{ column_name }} {% if not loop.last %},{% endif %}

            {% endfor %}

            from {{ model }}

            where 1=0

            limit 1

        )



        select * from column_check

    {% endif %}

{% endtest %}

Ahh yep I'm predominantly using BigQuery these days where trailing commas in the select statement are allowed.

elyobo · 2022-06-22T11:34:06Z

Agree with all of @yummydum's suggestions, with the aim that the schema be complete (all columns present and described) and accurate (at least in that it doesn't document columns that do not exist). It looks like @benhinchley's macro ensures the latter part of this (documented columns must exist) but there doesn't seem to be a solution for the other requirements in here.

Is it worth reopening the issue (seems to have been closed by a bot, not closed as fixed or will not fix)?

Klimmy · 2023-04-07T15:35:55Z

@benhinchley, @schylarbrock, thank you for the solution with a test code. Very handy!

I've adjusted it to be a singular one (not a generic one). So with this version, we don't need to specify a test for every model in schema.yml.

{% set models = [] %}

{% for node in graph.nodes.values() | selectattr("resource_type", "equalto", "model") %}
    {% if node.columns|count > 0 %}

        {% do models.append(node) %}

    {% endif %}
    
{% endfor %}

{% for model in models %}
    
    SELECT
    NULL AS placeholder
    FROM {{ model.database ~ "." ~ model.schema ~ "." ~ model.name }}
    WHERE 1=0
    {% for column_name in model.columns %}
        AND {{ column_name }} IS NULL
    {% endfor %}
    {% if not loop.last %}UNION ALL{% endif %}

{% endfor %}

aadarshsingh191198 · 2024-01-23T13:31:39Z

So, to summarise there can be four possible cases:

Column present in schema, not present in the model
Column present in the model, not present in the schema
Model present but not listed in the schema
Model listed in the schema but not present

Using the test mentioned by @Klimmy,
a. we can successfully test for 1. The test failed when the schema had columns that the model didn't have
b. we can't test for 2. The test passed when the schema didn't have columns that the model had
c. we can't test for 3. The test passed when the schema didn't have the model that the pipeline have
d. we can't test for 4. The test passed when the schema had a model which the pipeline didn't have.

Now, the interesting part -

d. DBT throws a warning during the first compilation/ run for case 4.
c. {% if node.columns|count > 0 %} is true for case 3. So, we can throw an exception and if the model is not listed in the schema, the test will fail.

This way, I am able to cover 3 out of 4 cases. Please correct if there is any concern with my approach.

A doubt:

What I didn't understand. In the snippet by @Klimmy , why do we need

 {% if node.columns|count > 0 %}

        {% do models.append(node) %}

    {% endif %}

Why can't we just do {% do models.append(node) %} without checking for non-zero columns condition?

Infiniverse · 2024-03-19T20:46:26Z

Just coming to DBT and hitting this issue. It doesn’t sound too hard to resolve. Why has it not been added to DBT core yet?

ryanb8 · 2024-06-27T12:02:53Z

You can also use the contract enforced property to have the model fail at runtime!

https://docs.getdbt.com/reference/resource-configs/contract

schema.yml snippet:

    config:
      contract:
        enforced: true

elycenok-wowcorp · 2024-07-19T00:22:32Z

I've tried encorced

      - name: random_name
        data_type: string

test doesn't fail

elycenok-wowcorp · 2024-07-20T12:07:38Z

ok problem solved

realised the config was overridden in the .sql file
so adding contract did the trick

{{
    config(
        materialized="table",
        alias="....",
        contract={ "enforced": True }
    )
}}

after that dbt run started to fail due to missing columns in the YAML file
generated the missing columns in YAML using dbt-codegen package
also found and raised a documentation issue:
On hub.getdbt.com you've got an error in arguments of generate_model_yaml (model_names not model_name) dbt-codegen#176

drewbanin changed the title ~~field validation for schema.yml~~ strict field validation for schema.yml Jul 1, 2019

drewbanin mentioned this issue Nov 19, 2019

Support persist_docs config for columns for all core databases #1722

Closed

drewbanin mentioned this issue Mar 18, 2020

Allow for Explicit Schema Definitions (DDLs) when Creating a Table / Model #2191

Closed

drewbanin mentioned this issue Apr 16, 2020

Support persist docs on Snowflake: Relations and columns #2334

Closed

drewbanin mentioned this issue May 6, 2020

Check Columns present when running Schema Tests for a source #2405

Closed

jtcohen6 mentioned this issue Aug 10, 2020

Clarify --strict flag usage dbt-labs/docs.getdbt.com#337

Merged

2 tasks

jtcohen6 mentioned this issue Dec 4, 2020

Support column_name and column_schema in BigQuery #2936

Closed

jtcohen6 mentioned this issue Jan 2, 2021

More consistency for global CLI flags #2990

Closed

jtcohen6 mentioned this issue Feb 1, 2021

Improve Error Message for Persisted Comments on Columns #3039

Closed

5 tasks

JC-Lightfold mentioned this issue Apr 26, 2021

Support PK and FK declarations for ER diagram and modelling purposes #3295

Closed

jtcohen6 mentioned this issue Nov 18, 2021

[Feature] add on_schema_change='fail' option for tables, not just incremental models. #4150

Closed

1 task

github-actions bot added the stale Issues that have gone stale label Dec 21, 2021

github-actions bot closed this as completed Dec 29, 2021

jtcohen6 mentioned this issue Apr 28, 2022

[CT-575] [Bug] 1.0.0 Cannot execute an empty query #5183

Closed

1 task

jtcohen6 mentioned this issue Jan 13, 2023

dbt Constraints / model contracts #6271

Merged

14 tasks

jtcohen6 mentioned this issue Mar 17, 2023

[CT-2314] [Feature] Do we _have_ to call the boolean property contract? #7184

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strict field validation for schema.yml #1570

strict field validation for schema.yml #1570

jwerderits commented Jun 25, 2019

drewbanin commented Jun 25, 2019

jwerderits commented Jun 26, 2019

jack-arthurton-cko commented Jul 3, 2019

drewbanin commented Jul 3, 2019

mattm commented Jul 30, 2019

stumelius commented Nov 8, 2019

drewbanin commented Nov 8, 2019

mjhoshea commented Jan 29, 2020

ArafathC commented May 6, 2020

drewbanin commented May 7, 2020

ArafathC commented May 8, 2020

github-actions bot commented Dec 21, 2021

alexrosenfeld10 commented Feb 20, 2022 •

edited

Loading

alexrosenfeld10 commented Feb 20, 2022

alexrosenfeld10 commented Feb 20, 2022

benhinchley commented Feb 28, 2022

yummydum commented Apr 9, 2022 •

edited

Loading

schylarbrock commented May 16, 2022 •

edited

Loading

benhinchley commented May 17, 2022

elyobo commented Jun 22, 2022

Klimmy commented Apr 7, 2023 •

edited

Loading

aadarshsingh191198 commented Jan 23, 2024

Infiniverse commented Mar 19, 2024

ryanb8 commented Jun 27, 2024

elycenok-wowcorp commented Jul 19, 2024 •

edited

Loading

elycenok-wowcorp commented Jul 20, 2024 •

edited

Loading

strict field validation for schema.yml #1570

strict field validation for schema.yml #1570

Comments

jwerderits commented Jun 25, 2019

Issue: field validation for schema.yml

Issue description

Fix

Steps to reproduce

drewbanin commented Jun 25, 2019

jwerderits commented Jun 26, 2019

jack-arthurton-cko commented Jul 3, 2019

drewbanin commented Jul 3, 2019

mattm commented Jul 30, 2019

stumelius commented Nov 8, 2019

drewbanin commented Nov 8, 2019

mjhoshea commented Jan 29, 2020

ArafathC commented May 6, 2020

drewbanin commented May 7, 2020

ArafathC commented May 8, 2020

github-actions bot commented Dec 21, 2021

alexrosenfeld10 commented Feb 20, 2022 • edited Loading

alexrosenfeld10 commented Feb 20, 2022

alexrosenfeld10 commented Feb 20, 2022

benhinchley commented Feb 28, 2022

yummydum commented Apr 9, 2022 • edited Loading

schylarbrock commented May 16, 2022 • edited Loading

benhinchley commented May 17, 2022

elyobo commented Jun 22, 2022

Klimmy commented Apr 7, 2023 • edited Loading

aadarshsingh191198 commented Jan 23, 2024

Infiniverse commented Mar 19, 2024

ryanb8 commented Jun 27, 2024

elycenok-wowcorp commented Jul 19, 2024 • edited Loading

elycenok-wowcorp commented Jul 20, 2024 • edited Loading

alexrosenfeld10 commented Feb 20, 2022 •

edited

Loading

yummydum commented Apr 9, 2022 •

edited

Loading

schylarbrock commented May 16, 2022 •

edited

Loading

Klimmy commented Apr 7, 2023 •

edited

Loading

elycenok-wowcorp commented Jul 19, 2024 •

edited

Loading

elycenok-wowcorp commented Jul 20, 2024 •

edited

Loading