Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Allow multiple unique tests for the same column #4102

Closed
1 task done
ilmari-aalto opened this issue Oct 20, 2021 · 6 comments
Closed
1 task done

[Feature] Allow multiple unique tests for the same column #4102

ilmari-aalto opened this issue Oct 20, 2021 · 6 comments
Labels
dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request

Comments

@ilmari-aalto
Copy link

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

Enable more than one unique test for the same column.

I think there are usecases where you'd like to have several unique tests. For example, here's an example related to MRR. In this example, there can only be one conversion record per account, and separately, just one current record per account:

      - name: account_id
        tests:
          # There should be just one is_new_biz record per account
          - unique:
              where: "is_new_biz"
          # There should be just one is_current record per account
          - unique:
              where: "is_current"

Describe alternatives you've considered

  • In some cases several tests could be combined using multiple conditions in where
  • Writing a singular test would always work, but it's disconnected from the rest of the model configuration, and an extra effort
  • Also dbt_utils.unique_combination_of_columns could work but the intention of that test would be less clear

Who will this benefit?

Enabling more than one unique test per column would allow users combine existing tests in novel ways - users would have added flexibility with the same building blocks. A beginner dbt user will not even need to know of the possibility so it wouldn't need to confuse them.

Are you interested in contributing this feature?

No response

Anything else?

From the technical point of view, multiple unique tests are currently not possible because config is not part of the test name, and all tests get the same name. Perhaps a test order number (1, 2, ..), hash, or test config could be added to the test name to work past this. (Perhaps extended test naming would only kick in when more than one test of the same type exists for the same column.)

More discussion and examples in this Slack thread.

@ilmari-aalto ilmari-aalto added enhancement New feature or request triage labels Oct 20, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 21, 2021

edit: accidentally closed this issue when I submitted my comment too early!

Thanks for opening @ilmari-aalto!

In the general sense, there's some overlap with #3348, which proposes a better approach to node name construction across the board. But node naming is particularly difficult for specific instances of generic tests, and particularly so when dealing with test configs.

To boil down the big question, why is this ok?

      - name: my_column
        tests:
          - accepted_values:
              values: ['abc', '123']
              config:
                where: is_alphanumeric
          - accepted_values:
              values: ['123']
              config:
                where: is_numeric

When this isn't?

      - name: my_column
        tests:
          - accepted_values:
              values: ['abc', '123']
              config:
                where: is_alphanumeric
          - accepted_values:
              values: ['abc', '123']
              config:
                where: is_numeric

To make a long story short:

  • where is a test config, universally applicable to all tests, able to be defined and inherited in several places
  • whereas values is a test argument, unique to accepted_values, and can only be defined in this .yml file

Every node in dbt must have a unique identifier (unique_id), which is generally constructed as ``. Today, when generic tests construct their names, they include test arguments, but they do not include test configs. The underlying idea is, the same test can be configured in many ways, and still be the same test. As soon as its arguments change, it's actually a different test, just as if it were to be defined on a different model or column—in fact, this lines up with how model and `column_name` are test arguments, too.

resolution?

To resolve this issue, we could:

  • Include test configs in the test's name1 construction here (i.e. by not popping configs out of the set of args, or by putting them somewhere else where they could be included). What would this lose us? Every time a test's configuration changed, its name and unique_id would also change; it would show up in state:new; it would be impossible to tell, from a metadata perspective, if it's the same test or a different one failing over time.2
  • Create a carve-out for where in particular, since it seems to be the config most often at issue, in part because it used to be an argument for several custom generic tests in dbt-utils.

1 We could also include configs when constructing the unique_id, but not include them in the name, via the disambiguating hash. This is just a cosmetic choice with the same functional outcome.
2 Maybe that's not critical, if we care more about asking "how often do model_a's tests fail" vs. "how often does test_xyz fail."

workarounds

In the meantime, to work around this limitation, users can:

  • Define their own custom generic test, which accepts as a test argument custom_where. It wouldn't be possible to configure custom_where in all the ways and places that where can be configured, or to use it out-of-the-box with any generic test. That's the trade-off, in a nutshell, between configs and arguments.
  • Reconsider what's being testing. As you mention above, defining multiple unique tests on different conditional segments of a table sounds to me more like a unique test with multiple input columns, i.e. dbt_utils.unique_combination_of_columns.

There's a lot of resonance between this issue and #4103, since both want the ability to treat where as a test argument rather than a test config. The resolution there would be different (and substantially trickier), but the proposed workaround is exactly the same.

conclusion (for now)

For my part, I'm not opposed to ever making this change; I feel the jury is still out. There's a real trade-off here, in terms of losing the ability to map tests across invocations/configurations; it feels like a conceptually sound guardrail; and there exist some very reasonable workarounds to achieve the same desired behavior. At the same time, we may look back at this issue several months from now and feel like it's the most obvious and uncontroversial change in the world. If that's the case, we should make it.

@judahrand
Copy link

judahrand commented Mar 14, 2022

This is a rather annoying behaviour for us. We've just upgraded to 1.0 from 0.19 and had assumed that with the addition of the Test Config we'd be able to get rid of our custom recency test. However, due to this behaviour we cannot. One of our use cases looks like this:

- recency:
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'tiktok'"
- recency:
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'twitter'"
- recency:
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'awin'"
- recency:
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'reddit'"
- recency:
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'pinterest'"

The reason for defining the tests separately is to make it clear where the failure is coming from. For this case it seems pretty clear (to me at least) that these are different tests but they all collide due to not including the the config (or just the where clause) in the name. It would also be unhelpful to include the where in the hash but not in the name.

We are currently using the workaround of a a custom test with condition as an argument instead of where (this actually used to be where but that obviously broke when the test configs came in and everything started colliding!) but this seems like a silly required workaround.

@jtcohen6 do you see this case as any different to the original issue? Is this worth rethinking? It seems to me that the where config should definitely be included in the name as my examples avoid seem like fundamentally separate tests.

@jtcohen6
Copy link
Contributor

@judahrand Appreciate the comment!

Rather than hoping to perfect the logic for auto-generating test names (a losing proposition), we think the longer-term fix might look like what's proposed in #3348 (comment) and attempted in #4898. Namely:

- recency:
    name: tiktok_recent_within_2_days
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'tiktok'"
- recency:
    name: twitter_recent_within_2_days
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'twitter'"
- recency:
    name: awin_recent_within_2_days
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'awin'"
- recency:
    name: reddit_recent_within_2_days
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'reddit'"
- recency:
    name: pinterest_recent_within_2_days
    datepart: day
    field: date
    interval: 2
    where: "connection_type_code = 'pinterest'"

What do you think of that?

@judahrand
Copy link

@jtcohen6 Yeah, I think that makes sense and would solve our issue. Though my one concern is how obvious it would be from the current warning that the use can fix it by manually naming the tests. I suppose this could be solved either in the warning or in the docs?

@ilmari-aalto
Copy link
Author

@jtcohen6 - based on a dbt Slack conversation I just realized that this was resolved by custom test names in dbt-labs/docs.getdbt.com#1269. A very neat solution to get past this problem!

Unless you have something else pending, feel free to close this issue, or I can do it as well if you confirm it's OK?

@jtcohen6
Copy link
Contributor

Yessir! Closing now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants