[Feature] Allow multiple `unique` tests for the same column #4102

ilmari-aalto · 2021-10-20T17:17:56Z

Is there an existing feature request for this?

I have searched the existing issues

Describe the Feature

Enable more than one unique test for the same column.

I think there are usecases where you'd like to have several unique tests. For example, here's an example related to MRR. In this example, there can only be one conversion record per account, and separately, just one current record per account:

      - name: account_id
        tests:
          # There should be just one is_new_biz record per account
          - unique:
              where: "is_new_biz"
          # There should be just one is_current record per account
          - unique:
              where: "is_current"

Describe alternatives you've considered

In some cases several tests could be combined using multiple conditions in where
Writing a singular test would always work, but it's disconnected from the rest of the model configuration, and an extra effort
Also dbt_utils.unique_combination_of_columns could work but the intention of that test would be less clear

Who will this benefit?

Enabling more than one unique test per column would allow users combine existing tests in novel ways - users would have added flexibility with the same building blocks. A beginner dbt user will not even need to know of the possibility so it wouldn't need to confuse them.

Are you interested in contributing this feature?

No response

Anything else?

From the technical point of view, multiple unique tests are currently not possible because config is not part of the test name, and all tests get the same name. Perhaps a test order number (1, 2, ..), hash, or test config could be added to the test name to work past this. (Perhaps extended test naming would only kick in when more than one test of the same type exists for the same column.)

More discussion and examples in this Slack thread.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2021-10-21T12:36:07Z

edit: accidentally closed this issue when I submitted my comment too early!

Thanks for opening @ilmari-aalto!

In the general sense, there's some overlap with #3348, which proposes a better approach to node name construction across the board. But node naming is particularly difficult for specific instances of generic tests, and particularly so when dealing with test configs.

To boil down the big question, why is this ok?

      - name: my_column
        tests:
          - accepted_values:
              values: ['abc', '123']
              config:
                where: is_alphanumeric
          - accepted_values:
              values: ['123']
              config:
                where: is_numeric

When this isn't?

      - name: my_column
        tests:
          - accepted_values:
              values: ['abc', '123']
              config:
                where: is_alphanumeric
          - accepted_values:
              values: ['abc', '123']
              config:
                where: is_numeric

To make a long story short:

where is a test config, universally applicable to all tests, able to be defined and inherited in several places
whereas values is a test argument, unique to accepted_values, and can only be defined in this .yml file

Every node in dbt must have a unique identifier (unique_id), which is generally constructed as ``. Today, when generic tests construct their names, they include test arguments, but they do not include test configs. The underlying idea is, the same test can be configured in many ways, and still be the same test. As soon as its arguments change, it's actually a different test, just as if it were to be defined on a different model or column—in fact, this lines up with how model and `column_name` are test arguments, too.

resolution?

To resolve this issue, we could:

Include test configs in the test's name¹ construction here (i.e. by not popping configs out of the set of args, or by putting them somewhere else where they could be included). What would this lose us? Every time a test's configuration changed, its name and unique_id would also change; it would show up in state:new; it would be impossible to tell, from a metadata perspective, if it's the same test or a different one failing over time.²
Create a carve-out for where in particular, since it seems to be the config most often at issue, in part because it used to be an argument for several custom generic tests in dbt-utils.

¹ We could also include configs when constructing the unique_id, but not include them in the name, via the disambiguating hash. This is just a cosmetic choice with the same functional outcome.
² Maybe that's not critical, if we care more about asking "how often do model_a's tests fail" vs. "how often does test_xyz fail."

workarounds

In the meantime, to work around this limitation, users can:

Define their own custom generic test, which accepts as a test argument custom_where. It wouldn't be possible to configure custom_where in all the ways and places that where can be configured, or to use it out-of-the-box with any generic test. That's the trade-off, in a nutshell, between configs and arguments.
Reconsider what's being testing. As you mention above, defining multiple unique tests on different conditional segments of a table sounds to me more like a unique test with multiple input columns, i.e. dbt_utils.unique_combination_of_columns.

There's a lot of resonance between this issue and #4103, since both want the ability to treat where as a test argument rather than a test config. The resolution there would be different (and substantially trickier), but the proposed workaround is exactly the same.

conclusion (for now)

For my part, I'm not opposed to ever making this change; I feel the jury is still out. There's a real trade-off here, in terms of losing the ability to map tests across invocations/configurations; it feels like a conceptually sound guardrail; and there exist some very reasonable workarounds to achieve the same desired behavior. At the same time, we may look back at this issue several months from now and feel like it's the most obvious and uncontroversial change in the world. If that's the case, we should make it.

judahrand · 2022-03-14T09:26:00Z

This is a rather annoying behaviour for us. We've just upgraded to 1.0 from 0.19 and had assumed that with the addition of the Test Config we'd be able to get rid of our custom recency test. However, due to this behaviour we cannot. One of our use cases looks like this:

- recency:
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'tiktok'"
- recency:
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'twitter'"
- recency:
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'awin'"
- recency:
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'reddit'"
- recency:
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'pinterest'"

The reason for defining the tests separately is to make it clear where the failure is coming from. For this case it seems pretty clear (to me at least) that these are different tests but they all collide due to not including the the config (or just the where clause) in the name. It would also be unhelpful to include the where in the hash but not in the name.

We are currently using the workaround of a a custom test with condition as an argument instead of where (this actually used to be where but that obviously broke when the test configs came in and everything started colliding!) but this seems like a silly required workaround.

@jtcohen6 do you see this case as any different to the original issue? Is this worth rethinking? It seems to me that the where config should definitely be included in the name as my examples avoid seem like fundamentally separate tests.

jtcohen6 · 2022-03-18T08:06:52Z

@judahrand Appreciate the comment!

Rather than hoping to perfect the logic for auto-generating test names (a losing proposition), we think the longer-term fix might look like what's proposed in #3348 (comment) and attempted in #4898. Namely:

- recency:
    name: tiktok_recent_within_2_days
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'tiktok'"
- recency:
    name: twitter_recent_within_2_days
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'twitter'"
- recency:
    name: awin_recent_within_2_days
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'awin'"
- recency:
    name: reddit_recent_within_2_days
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'reddit'"
- recency:
    name: pinterest_recent_within_2_days
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'pinterest'"

What do you think of that?

judahrand · 2022-03-18T09:47:41Z

@jtcohen6 Yeah, I think that makes sense and would solve our issue. Though my one concern is how obvious it would be from the current warning that the use can fix it by manually naming the tests. I suppose this could be solved either in the warning or in the docs?

ilmari-aalto · 2022-06-16T13:27:25Z

@jtcohen6 - based on a dbt Slack conversation I just realized that this was resolved by custom test names in dbt-labs/docs.getdbt.com#1269. A very neat solution to get past this problem!

Unless you have something else pending, feel free to close this issue, or I can do it as well if you confirm it's OK?

jtcohen6 · 2022-06-16T13:39:20Z

Yessir! Closing now

ilmari-aalto added enhancement New feature or request triage labels Oct 20, 2021

jtcohen6 closed this as completed Oct 21, 2021

jtcohen6 reopened this Oct 21, 2021

jtcohen6 added dbt tests Issues related to built-in dbt testing functionality and removed triage labels Oct 21, 2021

jtcohen6 added the Team: Language label Nov 18, 2021

jtcohen6 mentioned this issue Jan 6, 2022

[Bug] Test alias not working with dbt_utils.accepted_range #4549

Closed

1 task

emmyoop mentioned this issue Mar 11, 2022

[CT-323] [Bug] Generic test names collide when same args but different config is used #4830

Closed

1 task

jtcohen6 mentioned this issue Mar 18, 2022

Custom names for generic tests #4898

Merged

7 tasks

jtcohen6 mentioned this issue Mar 25, 2022

Generic tests: custom names, alt format dbt-labs/docs.getdbt.com#1269

Closed

1 task

jtcohen6 closed this as completed Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Allow multiple `unique` tests for the same column #4102

[Feature] Allow multiple `unique` tests for the same column #4102

ilmari-aalto commented Oct 20, 2021

jtcohen6 commented Oct 21, 2021 •

edited

Loading

judahrand commented Mar 14, 2022 •

edited

Loading

jtcohen6 commented Mar 18, 2022

judahrand commented Mar 18, 2022

ilmari-aalto commented Jun 16, 2022

jtcohen6 commented Jun 16, 2022

[Feature] Allow multiple unique tests for the same column #4102

[Feature] Allow multiple unique tests for the same column #4102

Comments

ilmari-aalto commented Oct 20, 2021

Is there an existing feature request for this?

Describe the Feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

jtcohen6 commented Oct 21, 2021 • edited Loading

resolution?

workarounds

conclusion (for now)

judahrand commented Mar 14, 2022 • edited Loading

jtcohen6 commented Mar 18, 2022

judahrand commented Mar 18, 2022

ilmari-aalto commented Jun 16, 2022

jtcohen6 commented Jun 16, 2022

[Feature] Allow multiple `unique` tests for the same column #4102

[Feature] Allow multiple `unique` tests for the same column #4102

jtcohen6 commented Oct 21, 2021 •

edited

Loading

judahrand commented Mar 14, 2022 •

edited

Loading