Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lf/issue-49 compare all columns macro for testing #50

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
02e725d
create a macro, test__compare_all_columns, which can be used in a cus…
Jul 14, 2022
4c60d83
cant get my local project to recognize the new macro
Jul 14, 2022
bb9862e
Merge branch 'lf/compare-all-columns' into lf/issue-49--compare_all_c…
Jul 14, 2022
93c447b
rename macro to compare_all_columns
Jul 15, 2022
9e77cbb
fix jinja syntax, namely curly braces
Jul 15, 2022
ab84ac3
add to readme
Jul 15, 2022
2b02847
small update to readme
Jul 15, 2022
8fe9a54
integration test compare_all_columns
Jul 18, 2022
3f0a8d4
add conflict seed, remove logging stuff from macro
Jul 18, 2022
92d3bd8
add additional seeds and rename seeds
Jul 18, 2022
d74f587
update readme to reflect that compare_all_columns doesn't write resul…
Jul 18, 2022
ee91af8
add exclude_columns optional arg to compare_all_columns
Jul 18, 2022
c9cbd4f
create separate pop_columns macro and use it in compare_all_columns
Jul 18, 2022
9bc5578
exclude argument not being recognized as i would expect, wip
Jul 18, 2022
394cff9
tidy up refactoring of pop_columns
Jul 19, 2022
bfd4062
add placeholder for new macros in integration_tests/models/schema.yml
Jul 19, 2022
8d071cb
add args for direct_conflict_only and exclude_recent_hours to remove…
Jul 19, 2022
0f03b05
update readme with additional args
Jul 20, 2022
965df11
readme formatting
Jul 20, 2022
e524a3f
Merge branch 'main' into lf/issue-49--compare_all_columns_macro_for_t…
Jul 26, 2022
80bc6ce
remove pop_columns and use get_filtered_columns_in_relation instead
Jul 26, 2022
40db5ff
adjust implementation of get_filtered_columns_in_relation to work for…
Jul 26, 2022
5c4c288
update readme, create compare_column_values_count to support compare_…
Jul 27, 2022
4560b5c
switch count approach to verbose and let the user decide how they wan…
Jul 27, 2022
361fc5f
compare_column_values_verbose creates a tall table with one row per p…
Jul 27, 2022
34802ba
spruce up readme
Jul 27, 2022
478dfce
fix whitespace in dbt_project.yml
Jul 27, 2022
82a5f52
remove pop_columns from schema
Jul 27, 2022
d8ba98a
use Relation instead of {{prod_schema}}.{{model_name}}, which joel wa…
Aug 3, 2022
ae490d4
update readme
Aug 3, 2022
f719ee7
give get_filtered_columns_in_relation a string argument
Aug 9, 2022
e281ee8
switch back to ref
Aug 9, 2022
1205476
remove prod schema from seed for now
Aug 9, 2022
6ea4a3c
remove more tests
Aug 9, 2022
ca20593
fix bigquery test with ticks instead of quotes
Aug 9, 2022
4963587
remove package files?
Aug 9, 2022
cf9271c
remove more package files
Aug 9, 2022
f27f13c
add a newline
Aug 9, 2022
a676835
remove test stuff
Aug 9, 2022
15d9864
remove newline
Aug 9, 2022
52355f3
add a newline
Aug 9, 2022
7e255fe
trying to make chnages to schema.yml go away
Aug 9, 2022
87169ef
got integration_tests/models/schema.yml from main
Aug 9, 2022
3029cec
hacky workaround to make bigquery work
Aug 9, 2022
8d42358
end with elif
Aug 9, 2022
e9425b0
add comment explaining hacky workaround
Aug 9, 2022
164ae66
Merge branch 'main' into lf/issue-49--compare_all_columns_macro_for_t…
Aug 12, 2022
3ccad81
replace hard-coded relation with freeform user-input for a_relation a…
Aug 16, 2022
c9bc02d
replace bq workaround with adapter.quote
Aug 16, 2022
62c39f0
Update macros/compare_column_values_verbose.sql
Aug 17, 2022
449a39c
update readme
Aug 17, 2022
fd8440a
Merge branch 'lf/issue-49--compare_all_columns_macro_for_testing' of …
Aug 17, 2022
7ee1e0f
add summarize option to compare_all_columns
Aug 17, 2022
d57e988
add tess
Aug 17, 2022
7d6b09a
make col_a the primary key in tests
Aug 17, 2022
7034c65
might not have saved a file?
Aug 17, 2022
ab13892
make exlude columns optional
Aug 17, 2022
35b4d07
make exclude optional in default macro
Aug 17, 2022
d151d7b
fix non-default argument comes before default argument in compare_all…
Aug 17, 2022
6b6b9c5
fix unfortunate comma error in test
Aug 17, 2022
a8c9a07
make exclude_columns non-optional as a test
Aug 17, 2022
43e60da
change argument order in return(adapter.dispatch of compare_all_columns
Aug 17, 2022
beab6d0
try once more with exclude_columns brackets
Aug 17, 2022
771664c
create dedicated seeds for compare_all_columns; correct with_summary_…
leoebfolsom Aug 25, 2022
cb288e1
convert text columns to text to address postgres context
leoebfolsom Aug 25, 2022
b54e14d
remove rows where both columns are null from compare_column_values_ve…
leoebfolsom Aug 25, 2022
ce2ffa3
remove text casting
leoebfolsom Aug 25, 2022
5c4d914
try double quotes for exclude_columns
leoebfolsom Aug 25, 2022
d3e01ad
try using adapter.quote in compare_column_values_verbose column_to_co…
leoebfolsom Aug 25, 2022
a21fe35
change column_name to column in compare_column_values_verbose arg
leoebfolsom Aug 25, 2022
905f207
add curly braces
leoebfolsom Aug 25, 2022
fd7cbdc
remove adapter.quote
leoebfolsom Aug 25, 2022
40e5679
log column namesg
leoebfolsom Aug 25, 2022
f68a13c
add adapter.quote to column_to_compare
leoebfolsom Aug 26, 2022
f1c29b7
add adapter.quote to primary key in compare_column_values_verbose
leoebfolsom Aug 26, 2022
8d63ce7
try making summary_and_exclude test pass by removing the exclude
leoebfolsom Aug 26, 2022
4081a88
fix expected results seed to reflect that there are three rows now
leoebfolsom Aug 26, 2022
d6ea660
reinstate exclude now that postgres test is passing
leoebfolsom Aug 26, 2022
8daaa71
trivial change to try out pipeline
leoebfolsom Aug 29, 2022
c642d5f
update postgres version for testing
leoebfolsom Aug 29, 2022
be7747a
postgres 14.4
leoebfolsom Aug 29, 2022
48810d3
roll back postgres version
leoebfolsom Aug 29, 2022
9ba804f
change postgres version but use circleci instead of cimg
leoebfolsom Aug 29, 2022
6a9def4
switch back to cimg, try postgres 10.20
leoebfolsom Aug 29, 2022
4876f13
try circleci/postgres:10.20
leoebfolsom Aug 29, 2022
ae85d86
try specifying postgres db auth and environment
leoebfolsom Aug 29, 2022
0be0b2e
try adding adapter.quote to column_name in final subquery of compare_…
leoebfolsom Aug 29, 2022
8eba10c
undo that
leoebfolsom Aug 29, 2022
59fdea4
cast column_to_compare to text to satisfy redshift; drop analyses to …
leoebfolsom Aug 30, 2022
f42fc59
restore analyses
leoebfolsom Aug 30, 2022
422f48b
save smoke test
leoebfolsom Aug 30, 2022
d3ae413
test out removing adapter.quote from final select in compare_column_v…
leoebfolsom Aug 30, 2022
32c5195
remove all adapter quotes from compare_column_values_verbose
leoebfolsom Aug 30, 2022
b141af3
change caps of seed data to get snowflake to work
leoebfolsom Aug 31, 2022
fb600af
update cast text to cast string for bigquery
leoebfolsom Aug 31, 2022
fa1c5a2
try using adapter.quote
leoebfolsom Aug 31, 2022
1337c95
add if else to work around postgres casting issue
leoebfolsom Aug 31, 2022
5fdb982
add redshift to if statement regarding casting to text
leoebfolsom Aug 31, 2022
d26a841
fix null_in_a and null_in_b to exclude rows where the pk is missing
leoebfolsom Aug 31, 2022
8c1e7c8
solve bugs discovered while adding test data, mainly related to coale…
leoebfolsom Aug 31, 2022
e3deac3
remove aws binary pkg file
leoebfolsom Aug 31, 2022
2ecb37c
update readme
leoebfolsom Aug 31, 2022
0e8340d
add a test demoing the use of a where clause in a compare_all_columns…
leoebfolsom Aug 31, 2022
155cd69
add logfile to gitignore
leoebfolsom Sep 7, 2022
791f99c
include note about hard-coded relations
leoebfolsom Sep 7, 2022
7677cac
remove stale logfile
leoebfolsom Sep 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,13 @@ jobs:
build:
docker:
- image: cimg/python:3.9.9
- image: circleci/postgres:9.6.5-alpine-ram
- image: cimg/postgres:14.0
auth:
username: dbt-labs
password: ''
environment:
POSTGRES_USER: root
POSTGRES_DB: circle_test

steps:
- checkout
Expand Down
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@

target/
dbt_packages/
logs/
logfile
183 changes: 122 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ Useful macros when performing data audits
* [compare_queries](#compare_queries-source)
* [compare_column_values](#compare_column_values-source)
* [compare_relation_columns](#compare_relation_columns-source)
* [compare_all_columns](#compare_all_columns-source)
* [compare_column_values_verbose](#compare_column_values_verbose-source)

# Installation instructions
New to dbt packages? Read more about them [here](https://docs.getdbt.com/docs/building-a-dbt-project/package-management/).
Expand Down Expand Up @@ -160,67 +162,6 @@ number of your records don't match.
work as expected.


### Advanced usage:
Got a wide table, and want to iterate through all the columns? Try something
like this:
```
{%- set columns_to_compare=adapter.get_columns_in_relation(ref('dim_product')) -%}

{% set old_etl_relation_query %}
select * from public.dim_product
where is_latest
{% endset %}

{% set new_etl_relation_query %}
select * from {{ ref('dim_product') }}
{% endset %}

{% if execute %}
{% for column in columns_to_compare %}
{{ log('Comparing column "' ~ column.name ~'"', info=True) }}

{% set audit_query = audit_helper.compare_column_values(
a_query=old_etl_relation_query,
b_query=new_etl_relation_query,
primary_key="product_id",
column_to_compare=column.name
) %}

{% set audit_results = run_query(audit_query) %}
{% do audit_results.print_table() %}
{{ log("", info=True) }}

{% endfor %}
{% endif %}
```

This will give you an output like:
```
Comparing column "name"
| match_status | count_records | percent_of_total |
| -------------------- | ------------- | ---------------- |
| ✅: perfect match | 41,573 | 99.43 |
| 🤷: missing from b | 26 | 0.06 |
| 🙅: ‍values do not... | 212 | 0.51 |

Comparing column "msrp"
| match_status | count_records | percent_of_total |
| -------------------- | ------------- | ---------------- |
| ✅: perfect match | 31,145 | 74.49 |
| ✅: both are null | 10,557 | 25.25 |
| 🤷: missing from b | 22 | 0.05 |
| 🤷: value is null ... | 31 | 0.07 |
| 🤷: value is null ... | 4 | 0.01 |
| 🙅: ‍values do not... | 52 | 0.12 |

Comparing column "status"
| match_status | count_records | percent_of_total |
| -------------------- | ------------- | ---------------- |
| ✅: perfect match | 37,715 | 90.20 |
| 🤷: missing from b | 26 | 0.06 |
| 🙅: ‍values do not... | 4,070 | 9.73 |
```

### Advanced usage - dbt Cloud:
The ``.print_table()`` function is not compatible with dbt Cloud so an adjustment needs to be made in order to print the results. Replace the following section of code:
```
Expand Down Expand Up @@ -280,5 +221,125 @@ it is a date in our "b" relation.

```

## compare_all_columns ([source](macros/compare_all_columns.sql))
leoebfolsom marked this conversation as resolved.
Show resolved Hide resolved
This macro is designed to be added to a dbt test suite as a custom test. A
`compare_all_columns` test monitors changes data values when code is changed
as part of a PR or during development. It sets up a test that will fail
if any column values do not match.

Users can configure what exactly constitutes a value match or failure. If
there is a test failure, results can be inspected in the warehouse. The primary key
and the column name can be included in the test output that gets written to the warehouse.
This enables the user to join test results to relevant tables in your dev or prod schema to investigate the error.

### Usage:

_Note: this test should only be used on (and will only work on) models that have a primary key that is reliably `unique` and `not_null`. [Generic dbt tests](https://docs.getdbt.com/docs/building-a-dbt-project/tests#generic-tests) should be used to ensure the model being tested meets the requirements of `unique` and `not_null`._

To create a test for the `stg_customers` model, create a custom test
in the `tests` subdirectory of your dbt project that looks like this:

```
{{
audit_helper.compare_all_columns(
a_relation=ref('stg_customers'), -- in a test, this ref will compile as your dev or PR schema.
b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'), -- you can explicitly write a relation to select your production schema, or any other db/schema/table you'd like to use for comparison testing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably add a note here saying you can also hard code a table name - for interactive queries where someone is writing one-off code, it's not unreasonable to hardcode a table string for expediency's sake. If you're building it into a CI cycle, then yes please make a proper source/relation/etc

exclude_columns=['updated_at'],
primary_key='id'
)
}}
where not perfect_match
```
The `where not perfect_match` statement is an example of a filter you can apply to define what
constitutes a test failure. The test will fail if any rows don't meet the
requirement of a perfect match. Failures would include:

* If the primary key exists in both relations, but one model has a null value in a column.
* If a primary key is missing from one relation.
* If the primary key exists in both relations, but the value conflicts.

If you'd like the test to only fail when there are conflicting values, you could configure it like this:

```
{{
audit_helper.compare_all_columns(
a_relation=ref('stg_customers'),
b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'),
primary_key='id'
)
}}
where conflicting_values
```

#### Arguemnts:

* `a_relation` and `b_relation`: The [relations](https://docs.getdbt.com/reference#relation)
you want to compare. Any two relations that have the same columns can be used. In the
example above, two different approaches to writing relations, using `ref` and
using `api.Relation.create`, are demonstrated. (When writing one-off code, it might make sense to
hard-code a relation, like this: `analytics_prod.stg_customers`. A hard-coded relation
is not recommended when building this macro into a CI cycle.)
* `exclude_columns` (optional): Any columns you wish to exclude from the
validation.
* `primary_key`: The primary key of the model. Used to sort unmatched
results for row-by-row validation.

If you want to create test results that include columns from the model itself
for easier inspection, that can be written into the test:

```
{{
audit_helper.compare_all_columns(
a_relation=ref('stg_customers'),
b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'),
exclude_columns=['updated_at'],
primary_key='id'
)
}}
left join {{ ref('stg_customers') }} using(id)
```

This structure also allows for the test to group or filter by any attribute in the model or in
the macro's output as part of the test, for example:

```
with base_test_cte as (
{{
audit_helper.compare_all_columns(
a_relation=ref('stg_customers'),
b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'),
exclude_columns=['updated_at'],
primary_key='id'
)
}}
left join {{ ref('stg_customers') }} using(id)
where conflicting_values
)
select
status, -- assume there's a "status" column in stg_customers
count(distinct case when conflicting_values then id end) as conflicting_values
from base_test_cte
group by 1
```

You can write a `compare_all_columns` test on individual table; and the test will be run
as part of a full test suite run.

```
dbt test --select stg_customers
```

If you want to [store results in the warehouse for further analysis](https://docs.getdbt.com/docs/building-a-dbt-project/tests#storing-test-failures), add the `--store-failures`
flag.

```
dbt test --select stg_customers --store-failures
```

## compare_column_values_verbose ([source](macros/compare_column_values_verbose.sql))
This macro will return a query that, when executed, returns the same information as
`compare_column_values`, but not summarized. `compare_column_values_verbose` enables `compare_all_columns` to give the user more flexibility around what will result in a test failure.


# To-do:
* Macro to check if two schemas contain the same relations
11 changes: 11 additions & 0 deletions integration_tests/models/compare_all_columns_where_clause.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{% set a_relation=ref('data_compare_all_columns__market_of_choice_produce')%}

{% set b_relation=ref('data_compare_all_columns__albertsons_produce') %}

{{ audit_helper.compare_all_columns(
a_relation=a_relation,
b_relation=b_relation,
primary_key="id",
summarize=false
) }}
where not perfect_match
9 changes: 9 additions & 0 deletions integration_tests/models/compare_all_columns_with_summary.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{% set a_relation=ref('data_compare_all_columns__market_of_choice_produce')%}

{% set b_relation=ref('data_compare_all_columns__albertsons_produce') %}

{{ audit_helper.compare_all_columns(
a_relation=a_relation,
b_relation=b_relation,
primary_key="id"
) }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{% set a_relation=ref('data_compare_all_columns__market_of_choice_produce')%}

{% set b_relation=ref('data_compare_all_columns__albertsons_produce') %}

{{ audit_helper.compare_all_columns(
a_relation=a_relation,
b_relation=b_relation,
primary_key="id",
exclude_columns=['ripeness']
) }}
10 changes: 10 additions & 0 deletions integration_tests/models/compare_all_columns_without_summary.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{% set a_relation=ref('data_compare_all_columns__market_of_choice_produce')%}

{% set b_relation=ref('data_compare_all_columns__albertsons_produce') %}

{{ audit_helper.compare_all_columns(
a_relation=a_relation,
b_relation=b_relation,
primary_key="id",
summarize=false
) }}
21 changes: 21 additions & 0 deletions integration_tests/models/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,24 @@ models:
tests:
- dbt_utils.equality:
compare_model: ref('expected_results__compare_relations_without_exclude')

- name: compare_all_columns_with_summary
tests:
- dbt_utils.equality:
compare_model: ref('expected_results__compare_all_columns_with_summary')

- name: compare_all_columns_without_summary
tests:
- dbt_utils.equality:
compare_model: ref('expected_results__compare_all_columns_without_summary')


- name: compare_all_columns_with_summary_and_exclude
tests:
- dbt_utils.equality:
compare_model: ref('expected_results__compare_all_columns_with_summary_and_exclude')

- name: compare_all_columns_where_clause
tests:
- dbt_utils.equality:
compare_model: ref('expected_results__compare_all_columns_where_clause')
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
id,fruit,ripeness
1,banana,yellow
2,banana,brown
3,banana,brown
4,orange,green
5,orange,orange
6,,brown
7,orange,orange
9,apple,mushy
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
id,fruit,ripeness
1,banana,yellow
2,banana,green
3,banana,brown
4,orange,green
5,orange,orange
6,orange,brown
7,orange,
8,apple,mushy
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
primary_key,column_name,perfect_match,null_in_a,null_in_b,missing_from_a,missing_from_b,conflicting_values
8,ID,false,false,false,false,true,false
9,ID,false,false,false,true,false,false
6,FRUIT,false,false,true,false,false,false
8,FRUIT,false,false,false,false,true,false
9,FRUIT,false,false,false,true,false,false
2,RIPENESS,false,false,false,false,false,true
7,RIPENESS,false,true,false,false,false,false
8,RIPENESS,false,false,false,false,true,false
9,RIPENESS,false,false,false,true,false,false
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
column_name,perfect_match,null_in_a,null_in_b,missing_from_a,missing_from_b,conflicting_values
ID,7,0,0,1,1,0
FRUIT,6,0,1,1,1,0
RIPENESS,5,1,0,1,1,1
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
column_name,perfect_match,null_in_a,null_in_b,missing_from_a,missing_from_b,conflicting_values
ID,7,0,0,1,1,0
FRUIT,6,0,1,1,1,0
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
primary_key,column_name,perfect_match,null_in_a,null_in_b,missing_from_a,missing_from_b,conflicting_values
1,ID,true,false,false,false,false,false
2,ID,true,false,false,false,false,false
3,ID,true,false,false,false,false,false
4,ID,true,false,false,false,false,false
5,ID,true,false,false,false,false,false
6,ID,true,false,false,false,false,false
7,ID,true,false,false,false,false,false
8,ID,false,false,false,false,true,false
9,ID,false,false,false,true,false,false
1,FRUIT,true,false,false,false,false,false
2,FRUIT,true,false,false,false,false,false
3,FRUIT,true,false,false,false,false,false
4,FRUIT,true,false,false,false,false,false
5,FRUIT,true,false,false,false,false,false
6,FRUIT,false,false,true,false,false,false
7,FRUIT,true,false,false,false,false,false
8,FRUIT,false,false,false,false,true,false
9,FRUIT,false,false,false,true,false,false
1,RIPENESS,true,false,false,false,false,false
2,RIPENESS,false,false,false,false,false,true
3,RIPENESS,true,false,false,false,false,false
4,RIPENESS,true,false,false,false,false,false
5,RIPENESS,true,false,false,false,false,false
6,RIPENESS,true,false,false,false,false,false
7,RIPENESS,false,true,false,false,false,false
8,RIPENESS,false,false,false,false,true,false
9,RIPENESS,false,false,false,true,false,false
Loading