-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lf/issue-49 compare all columns macro for testing #50
lf/issue-49 compare all columns macro for testing #50
Conversation
…olumns_macro_for_testing Merging the proposed update from PR 47 into this one since it is an upstream dependency.
I've run into this error when trying to build a test for
I also asked this question in Slack. |
…ts to logs automatically, to align it with the other existing macros
…testing error noise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good start 🤩 I've picked a lot of holes in the specifics, but the bones of it are great and I like where it's going!
I'll come back to the docs and integration tests once the foundations are in place, don't want to get you to make a bunch of changes when other feedback might change the expectations there too.
Also I jumped all over the place during this review so if the references to other comments are unclear then lmk and I can fix them up!
macros/compare_all_columns.sql
Outdated
{% set columns_to_compare=audit_helper.pop_columns(model_name, exclude_columns) %} | ||
|
||
{% set old_etl_relation_query %} | ||
select * from {{prod_schema}}.{{ model_name }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worries me a bit - I understand why it's necessary, but I am hoping there's a safer way.
In general, hardcoding table references is a no-no. In particular, I'm worried that people who have overridden generate_alias_name
to be environment aware will run into problems.
Might ask around and see if there's anything to be done with api.Relation.create
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, it could be solved by the other comments I'm about to write which relate to whether this is sufficiently generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading api.Relation.create docs and found this:
This object should always be used instead of interpolating values with {{ schema }}.{{ table }} directly.
so ... "I'm in these docs and I don't like it"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm not totally sure if the way I've updated the code this fully addresses the issue you raise, but please have a look, I think it may do the trick.
macros/compare_all_columns.sql
Outdated
@@ -0,0 +1,59 @@ | |||
{% macro compare_all_columns(model_name, primary_key, prod_schema, exclude_columns, updated_at_column, exclude_recent_hours, direct_conflicts_only) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As hinted in the below comment, I can see why all of these params are useful, but it feels like a very specific implementation to match what you're trying to do.
I think that a better macro signature would look much more like the existing compare_column_values macro:
{% macro compare_all_columns(model_name, primary_key, prod_schema, exclude_columns, updated_at_column, exclude_recent_hours, direct_conflicts_only) -%} | |
{% macro compare_every_columns_values(a_query, b_query, primary_key) -%} |
(Note: I considered adding direct_conflicts_only
as well, then decided against it - there'll be a comment about that too!)
For you to use this internally at scale, you would then make a macro which generated your a_query
and b_query
and passed those into this macro:
{% set a_query = generate_comparison_query(..., is_prod=False) %}
{% set b_query = generate_comparison_query(..., is_prod=True) %}
By doing this, you don't have to worry about accounting for how other projects generate their schema and model names - if your project always has the same model names everywhere then you can hardcode {{ prod_schema }}.{{ model_name }}
and I promise not to complain 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heard. You gave me some really good ideas, and I agree that boolean columns unlocks a lot of flexibility. I'm getting there, but still some work to do, and possibly some follow-up questions, on this one.
As you'll see in the readme and elsewhere, I'm trying out an approach where the test generates a row for every primary key x column combination. It's a lot of rows, but if you write a reasonable filter on your test, it gives you really good information, and you can even join the results to other tables!
The other option is that for someone who truly just wants a summary, they can turn the macro into a CTE in their test file and then summarize away.
No comment yet on the hardcoded model/schema names; I need to dig in more to figure out a viable alternative.
macros/compare_all_columns.sql
Outdated
*/ | ||
( {{ audit_query }} ) | ||
{% if not loop.last %} | ||
union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A tiny thing, but union all
means that the DB engine doesn't have to run a distinct
evaluation over your result set, which in your case is guaranteed because you're bringing through a different column name every time. https://stackoverflow.com/questions/49925/what-is-the-difference-between-union-and-union-all
union | |
union all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this open so I can dig in more / learn rather than just making the change! Thanks!
… compare_all_columns and compare_relations
…all_columns with count columns
…t to analyze the results
…rimary key per column and boolean columns
…get redshift to pass locally
…alues_verbose.sql
…scing null values to false, and errors in seed data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just the gitignore and one comment, then we're ready to rock. congrats 🤩
integration_tests/logfile
Outdated
@@ -0,0 +1,218 @@ | |||
2022-08-31 12:57:58.197 PDT [2839] LOG: starting PostgreSQL 14.5 (Homebrew) on x86_64-apple-darwin21.5.0, compiled by Apple clang version 13.1.6 (clang-1316.0.21.2.5), 64-bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in .gitignore
🙏
{{ | ||
audit_helper.compare_all_columns( | ||
a_relation=ref('stg_customers'), -- in a test, this ref will compile as your dev or PR schema. | ||
b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'), -- you can explicitly write a relation to select your production schema, or any other db/schema/table you'd like to use for comparison testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably add a note here saying you can also hard code a table name - for interactive queries where someone is writing one-off code, it's not unreasonable to hardcode a table string for expediency's sake. If you're building it into a CI cycle, then yes please make a proper source/relation/etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for coming on this adventure with me!
Description & motivation
This PR creates a macro
compare_all_columns
which can be included in a dbt test suite to test whether any values in any columns have changed.The output of
compare_all_columns
allows for various testing configurations, including but not limited to:Issue this PR is meant to address: #49
To resolve some Circle CI testing challenges, upgraded Postgres version defined in
.circleci/config.yml
from 9.6.5 to 14.0.Checklist