Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement group_by_columns argument for relevant tests #633

Merged
merged 15 commits into from
Aug 26, 2022
Merged

Implement group_by_columns argument for relevant tests #633

merged 15 commits into from
Aug 26, 2022

Conversation

emilyriederer
Copy link
Contributor

@emilyriederer emilyriederer commented Aug 8, 2022

This is a:

  • documentation update
  • bug fix with no breaking changes
  • new functionality
  • a breaking change

All pull requests from community contributors should target the main branch (default).

Description & motivation

Description

This PR closes #450 and #447 by implementing an optional group_by_columns argument across many of the core tests in dbt-utils. Specifically, I extended this check to allow of the relevant tests. Collectively:

  • equal_rowcount()
  • fewer_rows_than()
  • recency()
  • at_least_one()
  • not_constant()
  • sequential_values()
  • non_null_proportion()

For example, to test for at least one valid value by group, the group_by_columns argument could be used as follows:

  - name: data_test_at_least_one
    columns:
      - name: field
        tests:
          - dbt_utils.at_least_one:
              group_by_columns: ['grouping_column']

Motivation

The motivation for this PR is outlined as greater length in this blog post. In short:

  • Some data checks can only be expressed within a group (e.g. ID values should be unique within a group but can be repeated between groups)
  • Some data checks are more precise when done by group (e.g. not only should table rowcounts be equal but the counts within each group should be equal)

Implementation

In implementing this PR, I considered a few core principles:

  • Make this feature as unobtrusive and isolated as possible with respect to the macros broader implementation
  • Follow standard DRY principles (e.g. specifically, render needed text as few times as possible)
  • Implement consistently across macros

With these principles in mind, the majority of implementations are like that of the recency macro (L7-11) where all relevant SQL strings are precomputed:

{% set threshold = dbt_utils.dateadd(datepart, interval * -1, dbt_utils.current_timestamp()) %}
{% if group_by_columns|length() > 0 %}
  {% set select_gb_cols = group_by_columns|join(' ,') + ', ' %}
  {% set groupby_gb_cols = 'group by ' + group_by_columns|join(',') %}
{% endif %}

The main deviations to this were the sequential() macro (requiring a window function) and the equal_rowcount()/fewer_rows_than() (requiring joins)

Notes

  • A "test" PR for initial design feedback is discussed in Implement schema tests by group/partition (WIP - not ready for review) #451. Due to significant changes to dbt-utils, I'm sending this as a fresh PR, but the prior issue may add additional context.
  • Per the checklist below, I do not believe that it is relevant to change the README.md for this PR since it only demonstrates simpler examples of macro usage. Please let me know if you would like me to do so.

Checklist

  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I followed guidelines to ensure that my changes will work on "non-core" adapters by:
    • dispatching any new macro(s) so non-core adapters can also use them (e.g. the star() source)
    • using the limit_zero() macro in place of the literal string: limit 0
    • using dbt_utils.type_* macros instead of explicit datatypes (e.g. dbt_utils.type_timestamp() instead of TIMESTAMP
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md

Sorry, something went wrong.

@joellabes joellabes self-requested a review August 9, 2022 00:05
@joellabes
Copy link
Contributor

YAY. Thanks @emilyriederer - I will dig into this probably next week?

@joellabes joellabes added the 1.0 Changes to include in version 1.0 (especially breaking changes) label Aug 17, 2022
Copy link
Contributor

@joellabes joellabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really elegant. I kept starting to write comments along the lines of "if you did XXX instead it would be a bit tidier", and then as I wrote it I realised that it was actually covering like 3 different edge cases that my proposal wouldn't have.

I've got one nitpick around comma structure and one pattern where I think we could end up in a column name conflict, but otherwise this is magnifique 🤩

macros/generic_tests/equal_rowcount.sql Outdated Show resolved Hide resolved
macros/generic_tests/fewer_rows_than.sql Outdated Show resolved Hide resolved
macros/generic_tests/sequential_values.sql Outdated Show resolved Hide resolved
@emilyriederer
Copy link
Contributor Author

Thanks for reviewing, @joellabes !

I fixed the simpler comma issue and outlined a few options for the other.

Since it feels like we are close, I'll ask one bigger picture question: Is there any better way I should document these changes?

This felt like too small of a feature to note on the README that discusses macro usage. However, as it stands, this will be quite a "hidden feature" that users only learn if they read the underlying source code for the macros. (And, even there, there's no comment to define what that argument does so they'd literally have to parse it out for themselves.)

@joellabes
Copy link
Contributor

@emilyriederer thanks for doing the research! have replied on that comment above.

This felt like too small of a feature to note on the README that discusses macro usage

Nah I think this is a big deal! To save a ton of repeated documentation, I would recommend documenting it all once at the list of the generic tests, and then saying something like "This test also supports the group_by parameter; see group by for details". (that anchor obviously doesn't work in this issue)

@emilyriederer
Copy link
Contributor Author

Thanks @joellabes ! I've added that section to the README along with other changes. Let me know if you want either more/less detail.

Copy link
Contributor

@joellabes joellabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sooooo close 😍 thanks for sticking with it! These two changes are the only things I can see holding it back; I would just commit them myself to save you a job but want to check that I've understood them properly!

README.md Outdated Show resolved Hide resolved
macros/generic_tests/fewer_rows_than.sql Outdated Show resolved Hide resolved
emilyriederer and others added 2 commits August 26, 2022 04:46
Co-authored-by: Joel Labes <joel.labes@dbtlabs.com>
Co-authored-by: Joel Labes <joel.labes@dbtlabs.com>
Copy link
Contributor

@joellabes joellabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE. INCREDIBLE.

image

What a way to earn the badge! Thank you so much 🌟🌟🌟🌟🌟

@joellabes joellabes merged commit ed47585 into dbt-labs:main Aug 26, 2022
@emilyriederer
Copy link
Contributor Author

Thanks for the review, @joellabes ! Excited to have this merged and to start using 🤓

joellabes added a commit that referenced this pull request Aug 26, 2022
@joellabes
Copy link
Contributor

I just reverted it as I'll put it onto the 1.0 branch, not main - I think it'll get very confusing if it's in the readme now but doesn't come out for a while longer! Sorry for the confusion 😩 will sort it out properly when I'm back online next week 🙇‍♂️

@joellabes
Copy link
Contributor

(as with all git installs, you'll be able to install that branch directly if you want to get ahead of the curve though!)

@emilyriederer
Copy link
Contributor Author

Totally makes sense - thanks for the headsup!

@mgcdanny
Copy link

@emilyriederer

  • Can this feature be used to test that the count of groups are monotonically increasing over time? SELECT COUNT(*) FROM table GROUP BY (month-year, segment)
  • Can this feature be used to test that the count of "groups" and count of rows in the groups are the same across two tables? SELECT COUNT(*) FROM table_a GROUP BY (month-year, segment) , SELECT COUNT(*) FROM table_b GROUP BY (month-year, segment)

Appreciate it!

Thanks!

@emilyriederer
Copy link
Contributor Author

Hi @mgcdanny - this feature extends the following checks:

  • equal_rowcount()
  • fewer_rows_than()
  • recency()
  • at_least_one()
  • not_constant()
  • sequential_values()
  • non_null_proportion()

Those all work the same as in dbt-utils, but they are now assessed separately for each group provided.

None of these compare between groups, so I don't believe any could accomplish your first question. However, I believe the second can be implement with equal_rowcount()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0 Changes to include in version 1.0 (especially breaking changes)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add grouping/partitioning to relevant schema tests
3 participants