Implement `group_by_columns` argument for relevant tests #633

emilyriederer · 2022-08-08T22:57:26Z

This is a:

documentation update
bug fix with no breaking changes
new functionality
a breaking change

All pull requests from community contributors should target the main branch (default).

Description & motivation

Description

This PR closes #450 and #447 by implementing an optional group_by_columns argument across many of the core tests in dbt-utils. Specifically, I extended this check to allow of the relevant tests. Collectively:

equal_rowcount()
fewer_rows_than()
recency()
at_least_one()
not_constant()
sequential_values()
non_null_proportion()

For example, to test for at least one valid value by group, the group_by_columns argument could be used as follows:

  - name: data_test_at_least_one
    columns:
      - name: field
        tests:
          - dbt_utils.at_least_one:
              group_by_columns: ['grouping_column']

Motivation

The motivation for this PR is outlined as greater length in this blog post. In short:

Some data checks can only be expressed within a group (e.g. ID values should be unique within a group but can be repeated between groups)
Some data checks are more precise when done by group (e.g. not only should table rowcounts be equal but the counts within each group should be equal)

Implementation

In implementing this PR, I considered a few core principles:

Make this feature as unobtrusive and isolated as possible with respect to the macros broader implementation
Follow standard DRY principles (e.g. specifically, render needed text as few times as possible)
Implement consistently across macros

With these principles in mind, the majority of implementations are like that of the recency macro (L7-11) where all relevant SQL strings are precomputed:

{% set threshold = dbt_utils.dateadd(datepart, interval * -1, dbt_utils.current_timestamp()) %}
{% if group_by_columns|length() > 0 %}
  {% set select_gb_cols = group_by_columns|join(' ,') + ', ' %}
  {% set groupby_gb_cols = 'group by ' + group_by_columns|join(',') %}
{% endif %}

The main deviations to this were the sequential() macro (requiring a window function) and the equal_rowcount()/fewer_rows_than() (requiring joins)

Notes

A "test" PR for initial design feedback is discussed in Implement schema tests by group/partition (WIP - not ready for review) #451. Due to significant changes to dbt-utils, I'm sending this as a fresh PR, but the prior issue may add additional context.
Per the checklist below, I do not believe that it is relevant to change the README.md for this PR since it only demonstrates simpler examples of macro usage. Please let me know if you would like me to do so.

Checklist

joellabes · 2022-08-09T00:06:41Z

YAY. Thanks @emilyriederer - I will dig into this probably next week?

joellabes

This is really elegant. I kept starting to write comments along the lines of "if you did XXX instead it would be a bit tidier", and then as I wrote it I realised that it was actually covering like 3 different edge cases that my proposal wouldn't have.

I've got one nitpick around comma structure and one pattern where I think we could end up in a column name conflict, but otherwise this is magnifique 🤩

macros/generic_tests/equal_rowcount.sql

macros/generic_tests/fewer_rows_than.sql

macros/generic_tests/sequential_values.sql

emilyriederer · 2022-08-19T10:33:31Z

Thanks for reviewing, @joellabes !

I fixed the simpler comma issue and outlined a few options for the other.

Since it feels like we are close, I'll ask one bigger picture question: Is there any better way I should document these changes?

This felt like too small of a feature to note on the README that discusses macro usage. However, as it stands, this will be quite a "hidden feature" that users only learn if they read the underlying source code for the macros. (And, even there, there's no comment to define what that argument does so they'd literally have to parse it out for themselves.)

joellabes · 2022-08-22T05:13:10Z

@emilyriederer thanks for doing the research! have replied on that comment above.

This felt like too small of a feature to note on the README that discusses macro usage

Nah I think this is a big deal! To save a ton of repeated documentation, I would recommend documenting it all once at the list of the generic tests, and then saying something like "This test also supports the group_by parameter; see group by for details". (that anchor obviously doesn't work in this issue)

emilyriederer · 2022-08-22T10:41:00Z

Thanks @joellabes ! I've added that section to the README along with other changes. Let me know if you want either more/less detail.

joellabes

sooooo close 😍 thanks for sticking with it! These two changes are the only things I can see holding it back; I would just commit them myself to save you a job but want to check that I've understood them properly!

README.md

macros/generic_tests/fewer_rows_than.sql

Co-authored-by: Joel Labes <joel.labes@dbtlabs.com>

joellabes

DONE. INCREDIBLE.

What a way to earn the badge! Thank you so much 🌟🌟🌟🌟🌟

emilyriederer · 2022-08-26T10:06:36Z

Thanks for the review, @joellabes ! Excited to have this merged and to start using 🤓

This reverts commit ed47585.

joellabes · 2022-08-26T10:09:47Z

I just reverted it as I'll put it onto the 1.0 branch, not main - I think it'll get very confusing if it's in the readme now but doesn't come out for a while longer! Sorry for the confusion 😩 will sort it out properly when I'm back online next week 🙇‍♂️

joellabes · 2022-08-26T10:11:40Z

(as with all git installs, you'll be able to install that branch directly if you want to get ahead of the curve though!)

emilyriederer · 2022-08-26T10:22:37Z

Totally makes sense - thanks for the headsup!

mgcdanny · 2022-09-20T18:03:36Z

@emilyriederer

Can this feature be used to test that the count of groups are monotonically increasing over time? SELECT COUNT(*) FROM table GROUP BY (month-year, segment)
Can this feature be used to test that the count of "groups" and count of rows in the groups are the same across two tables? SELECT COUNT(*) FROM table_a GROUP BY (month-year, segment) , SELECT COUNT(*) FROM table_b GROUP BY (month-year, segment)

Appreciate it!

Thanks!

emilyriederer · 2022-09-20T22:29:28Z

Hi @mgcdanny - this feature extends the following checks:

equal_rowcount()
fewer_rows_than()
recency()
at_least_one()
not_constant()
sequential_values()
non_null_proportion()

Those all work the same as in dbt-utils, but they are now assessed separately for each group provided.

None of these compare between groups, so I don't believe any could accomplish your first question. However, I believe the second can be implement with equal_rowcount()

emilyriederer added 7 commits August 7, 2022 19:30

Extend testing macros with group_by_columns arg

9fe1d2d

Add integration tests for group_by_columns macro args

006b3cf

Seed tests for group_by_columns arg in test macros

82751f6

Describe group_by_columns enhancement in CHANGELOG

3c64a91

Add integrations tests for fewer_rows_than macro

4f82440

change fake data in test_recency to numeric

772e3c4

fix changelog typo

241090d

joellabes self-requested a review August 9, 2022 00:05

joellabes added the 1.0 label Aug 17, 2022

joellabes requested changes Aug 19, 2022

View reviewed changes

macros/generic_tests/equal_rowcount.sql Outdated Show resolved Hide resolved

macros/generic_tests/fewer_rows_than.sql Outdated Show resolved Hide resolved

macros/generic_tests/sequential_values.sql Outdated Show resolved Hide resolved

emilyriederer added 3 commits August 19, 2022 04:51

whitespace after commas

b30f76e

remove id column added just for join

e65e44b

remove outer keyword from full join

4b0a512

emilyriederer added 3 commits August 22, 2022 05:25

use explicit fake join keys for equal_rowcount and fewer_rows_than

28c4e43

document grouping feature in README

c1b6100

fix join key name for consistency with macro name (cosmetic change)

33c6383

joellabes requested changes Aug 26, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

macros/generic_tests/fewer_rows_than.sql Outdated Show resolved Hide resolved

emilyriederer and others added 2 commits August 26, 2022 04:46

Use more descriptive group_by_columns README example

028499c

Co-authored-by: Joel Labes <joel.labes@dbtlabs.com>

Fix code comment typo in fewer_rows_than

e5f421b

Co-authored-by: Joel Labes <joel.labes@dbtlabs.com>

joellabes approved these changes Aug 26, 2022

View reviewed changes

joellabes merged commit ed47585 into dbt-labs:main Aug 26, 2022

joellabes added a commit that referenced this pull request Aug 26, 2022

Revert "Implement group_by_columns argument for relevant tests (#633)"

7351d1e

This reverts commit ed47585.

dataders mentioned this pull request Dec 16, 2022

equal_rowcount, fewer_rows_than macros don't work on Trino due to lateral column aliasing #744

Closed

dbeatty10 mentioned this pull request Jul 2, 2024

Fix at_least_one test when group_by_columns is configured #922

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `group_by_columns` argument for relevant tests #633

Implement `group_by_columns` argument for relevant tests #633

emilyriederer commented Aug 8, 2022 •

edited

Loading

joellabes commented Aug 9, 2022

joellabes left a comment

emilyriederer commented Aug 19, 2022

joellabes commented Aug 22, 2022

emilyriederer commented Aug 22, 2022

joellabes left a comment

joellabes left a comment

emilyriederer commented Aug 26, 2022

joellabes commented Aug 26, 2022

joellabes commented Aug 26, 2022

emilyriederer commented Aug 26, 2022

mgcdanny commented Sep 20, 2022

emilyriederer commented Sep 20, 2022

Implement group_by_columns argument for relevant tests #633

Implement group_by_columns argument for relevant tests #633

Conversation

emilyriederer commented Aug 8, 2022 • edited Loading

Description & motivation

Description

Motivation

Implementation

Notes

Checklist

joellabes commented Aug 9, 2022

joellabes left a comment

Choose a reason for hiding this comment

emilyriederer commented Aug 19, 2022

joellabes commented Aug 22, 2022

emilyriederer commented Aug 22, 2022

joellabes left a comment

Choose a reason for hiding this comment

joellabes left a comment

Choose a reason for hiding this comment

emilyriederer commented Aug 26, 2022

joellabes commented Aug 26, 2022

joellabes commented Aug 26, 2022

emilyriederer commented Aug 26, 2022

mgcdanny commented Sep 20, 2022

emilyriederer commented Sep 20, 2022

Implement `group_by_columns` argument for relevant tests #633

Implement `group_by_columns` argument for relevant tests #633

emilyriederer commented Aug 8, 2022 •

edited

Loading