Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2137] [Feature] Cross-database implementation of is_distinct_from #6997

Open
3 tasks done
joellabes opened this issue Feb 17, 2023 · 12 comments
Open
3 tasks done
Labels
enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Team:Adapters Issues designated for the adapter area of the code utils Cross-database building blocks

Comments

@joellabes
Copy link
Contributor

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

If I had a dollar for every time I had done

select a, b, 
  a = b or (a is null and b is null) as are_the_same
from table

or

select a, b, 
  coalesce(a, -1) = coalesce(b, -1) as are_the_same --very dangerous! do not do!
from table

,
I would have a lot of money.

Turns out that there is a solution for this in the sql standard!

image

We should add this as a cross-db macro - a speedy Google looks like most of our key adapters support it natively except for Redshift

Describe alternatives you've considered

Doing nothing.

Who will this benefit?

This seems useful for two reasons:

  • By creating and documenting it, more people might stumble over it
  • I would have given my left leg for this when building the first version of the metrics package, when I was trying to do joins that were potentially null

Are you interested in contributing this feature?

Yep! But it'll be my first cross-adapter contribution so I'll need some handholding

Anything else?

No response

@joellabes joellabes added enhancement New feature or request triage labels Feb 17, 2023
@github-actions github-actions bot changed the title [Feature] Cross-database implementation of is_distinct_from [CT-2137] [Feature] Cross-database implementation of is_distinct_from Feb 17, 2023
@jtcohen6 jtcohen6 added Team:Adapters Issues designated for the adapter area of the code utils Cross-database building blocks labels Feb 17, 2023
@dbeatty10 dbeatty10 self-assigned this Feb 17, 2023
@dbeatty10
Copy link
Contributor

Thanks for opening @joellabes !

What kind of syntax are you imagining, something like this?

{{ dbt.is_distinct_from(a, b) }}

Would you imagine it being being paired with is_not_distinct_from too? Or just rely on the user to do the negation?

For the Redshift implementation (or any other non-conforming database), we'd want to be diligent with the implementation.
As you noted as dangerous!, we wouldn't want to do this implementation, but one that works correctly for all possible input values. (I suspect the SQL in that link would be incorrect when one of width & height is 0 and the other is NULL.)

The truth table here would be handy for integration tests.

Side note: both of the SQL examples you gave would be awesome for a SQL linter to alert about!

@joellabes
Copy link
Contributor Author

What kind of syntax are you imagining, something like this?

{{ dbt.is_distinct_from(a, b) }}

Yeah I think that sounds right to me!

Would you imagine it being being paired with is_not_distinct_from too? Or just rely on the user to do the negation?

I imagined that the user would BYON (bring your own negation)

Side note: both of the SQL examples you gave would be awesome for a SQL linter to alert about!

As in it would say "hey did you know there is a better way to do this"? That would be very cool.

@dbeatty10
Copy link
Contributor

Two follow-up questions:

  1. Are there two or three instances where you've seen a this type of logic expressed (like in the metrics package or other packages? This could help us refine if we're getting the syntax right or not, especially if users are writing their own negations.
  2. Do you know what the generic SQL expression would look like for databases that don't natively support is distinct from?

@owenprough-sift
Copy link

2. Do you know what the generic SQL expression would look like for databases that don't natively support `is distinct from`?

Probably something like what Joel had in the first post:

( ( a = b ) or ( ( a is null ) and ( b is null ) ) ) /* use as many/few parens as make you feel safe, I reckon */

@joellabes
Copy link
Contributor Author

  1. Are there two or three instances where you've seen a this type of logic expressed (like in the metrics package or other packages? This could help us refine if we're getting the syntax right or not, especially if users are writing their own negations.

No, but that's a 100% fair question before putting it into Core. I desperately hope that my code is no longer in the metrics package 😂 (@callum-mcdata?)

I imagine negations would be

select a, b, 
not ({{ dbt.is_distinct_from('a', 'b') }}) as are_the_same
from {{ ref('the_table') }}

and joins would be

select *
from a
inner join b on not ({{ dbt.is_distinct_from('a.c', 'b.c') }})
and not ({{ dbt.is_distinct_from('a.d', 'b.d') }})
  1. Do you know what the generic SQL expression would look like for databases that don't natively support is distinct from?

Yeah agreed with Owen! mo' brackets le' problems, as they say

@dbeatty10
Copy link
Contributor

dbeatty10 commented Feb 23, 2023

Names of the features within the standard

Part 2 of the SQL standard (ISO/IEC 9075-2) added these IDs and names in SQL:1999 and SQL:2003, respectively ¹:

  • a is distinct from b
    • T151: "DISTINCT predicate"
    • equivalent to saying "a is not equal to b and at least one of them is not null"
  • a is not distinct from b
    • T152: "DISTINCT predicate with negation"
    • equivalent to saying "a and b are either equal or both null"

Should we flip it?

All the examples we've looked at so far are actually for T152 rather than T151!

So I'm wondering, maybe we want to actually implement the equivalent of {{ dbt.is_not_distinct_from('a', 'b') }} rather than {{ dbt.is_distinct_from('a', 'b') }}? To know one way or the other, I think we'll want to see how null-safe comparisons are used most often.

Pondering names

There are only two hard things in Computer Science: cache invalidation, naming things, and three-valued logic.

I can't say I fully grok it all, but I think the source of this issue we're trying to solve is that a = b and a != b aren't intuitive to humans because of SQL's three-valued logic.

I'm guessing that referring to T152 with one of the alternatives below is borderline blasphemous for some reason. Can someone explain to me like I'm five why they are a bad idea?

  • {{ dbt.is('a', 'b') }}
  • {{ dbt.same('a', 'b') }}
  • {{ dbt.equals('a', 'b') }}
  • {{ dbt.eq('a', 'b') }}
  • etc.

Shortening the name is inspired by SQLite's is and MySQL/MariaDB's <=> spaceship operator.

Using is might(?) actually have some synergy with optional feature F571, "Truth value tests" from the SQL standard.

Let's find some examples

Let's examine some examples from the wild to confirm/deny if T151 or T152 semantics are most common. A good place to search would be the GitHub repos of the packages listed at https://hub.getdbt.com. Some (untested!) case-insensitive ideas for regular expressions to use with grep:

  • (.*)=(.*)or(.*)is null(.*)is null(.*)
  • (.*)coalesce(.*)=(.*)coalesce(.*)

@callum-mcdata
Copy link
Contributor

Oh dang - I would have loved this functionality about 6 weeks ago when I was thinking about how we could support null-safe joins. More information in this issue here..

The use case for the metrics package was specifically around joining the outputs of two unique metrics queries where the values within a particular column might be null. Users were running into issues when the null values of dimension_a from metric_a weren't able to join to the null values from dimension_a from metric_b. This led to "misreporting" of numbers because the metric associated with the null values wasn't included in the join.

@dbeatty10
Copy link
Contributor

It could be nice to implement this macro to make it easy to express the logic in cases like dbt-labs/dbt-adapters#159, dbt snapshots, etc.

Refinement needed # 1

The biggest outstanding decision is naming.

Pros: equals(a, b) makes a lot more intuitive sense to me than is_not_distinct_from. It's also a lot easier to write than more explicit alternatives like null_safe_equals.

Cons: I'm unaware of the strongest arguments against the simple & intuitive naming, but would welcome that insight!

Summary: I'm attracted to something simple with clear intent like equals rather than is_not_distinct_from.

Refinement needed # 2

The other outstanding decision is whether to implement both the NULL-aware comparisons of equality and inequality or just equality.

(I'm thinking we wouldn't want to do only inequality as originally requested because the SQL is quite a bit more complicated and equality might be more common.)

If we support both, then the syntax and rendered SQL would be like the following:
{{ dbt.equals(a, b) }}

( (a = b) or (a is null and b is null) )

{{ dbt.not_equals(a, b) }}

not ( (a = b) or (a is null and b is null) )

Versus if we only do equality, then inequality would need to be expressed like this:

not {{ dbt.equals(a, b) }}

not ( (a = b) or (a is null and b is null) )

Pros: Doing both might make them easier to discover and use. The implementation of not equals follows trivially from equals. Opportunity for adapters to optimize both implementations if needed.

Cons: Two different macros to maintain, document, and test. Two different ways for the user to express inequality.

Summary: As long as there is a macro for expressing equality, it doesn't seem necessary that we offer a macro for inequality.

@joellabes
Copy link
Contributor Author

Question 1

I am in two minds:

  • coherent, simple naming is good (a point for dbt.equals or similar, although I would like it to make clear that it's null-aware, which is a demerit from straight dbt.equals)
  • for these primitives that are already named, reusing the hard work of the SQL Standards team is good (a point for dbt.is_distinct_from)

The good news is that my friends at Snowflake have split the difference, with what I assume is a wrapper over the top of is distinct from: equal_null.

So I would vote for either that, or something like safe_equals in the same vein as safe_add, safe_divide...

Question 2

I agree with you, I don't think we need both - people can prefix a not if they need to.

@dbeatty10
Copy link
Contributor

Naming

Most frequently, you'll find me on "team standards" which would have me throwing my weight behind dbt.is_not_distinct_from as the name for null-safe equals comparison. But it's unintuitive enough that I don't think it will solve our problem here. Same goes for equal_null.

After thinking on this quite a bit, equals feels best. Intuitive. Simple. Easy.

I'd expect it to be rare for people to be surprised about its behavior. Rather, it's simple enough to be used without reading or consulting the documentation. Can't imagine a sweeter spot than that.

Implementation

Due to the trickiness of SQL's three-valued logic, the implementation needs to be like this:

{% macro equals(actual, expected) %}
{# -- actual is not distinct from expected #}
case when {{ actual }} = {{ expected }} or ({{ actual }} is null and {{ expected }} is null)
    then 0
    else 1
end = 0
{% endmacro %}

Since postgres supports is [not] distinct from, it can be used to show why the case-when is necessary to force either a true or a false value (rather than stubbornly propagating null):

Example
with x as (

    select cast(1 as integer) as i union all
    select cast(2 as integer) as i union all
    select cast(null as integer) as i

),

permutations as (

    select
        x1.i as x1_i,
        x2.i as x2_i 
    from x as x1, x as x2
    order by x1.i, x2.i

)

select
    -- values to compare
    x1_i,
    x2_i,

    -- logic for equals macro
    x1_i = x2_i or (x1_i is null and x2_i is null) as equals_logic,

    -- compliant syntax for null-safe equality
    x1_i is not distinct from x2_i as is_not_distinct_from,

    -- case/when logic for equals macro
    case when x1_i = x2_i or (x1_i is null and x2_i is null)
        then 0
        else 1
    end = 0 as case_equals_macro,

    -- negation of logic for equals macro
    not (x1_i = x2_i or (x1_i is null and x2_i is null)) as not_equals_logic,

    -- compliant syntax for null-safe inequality
    x1_i is distinct from x2_i as is_distinct_from,

    -- negation of case/when logic for equals macro
    not case when x1_i = x2_i or (x1_i is null and x2_i is null)
     then 0
     else 1
    end = 0 as case_equals_macro

from permutations
;

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Nov 30, 2023
@dbeatty10 dbeatty10 removed the stale Issues that have gone stale label Nov 30, 2023
@dbeatty10 dbeatty10 removed their assignment Nov 30, 2023
@dbeatty10
Copy link
Contributor

Implementation

In #7776, we implemented a null-safe equals macro for the purposes of supporting test cases:

{% macro equals(expr1, expr2) -%}
case when (({{ expr1 }} = {{ expr2 }}) or ({{ expr1 }} is null and {{ expr2 }} is null))
then 0
else 1
end = 0
{% endmacro %}

Its logic is tested here (which uses these fixtures).

It is only used within functional tests in pytest; it is not currently available to the adapter outside of the pytest environment.

We could resolve this ticket by:

  1. Lifting-and-shift this macro definition into this folder.
  2. Changing the macro name to be {% macro default__equals(expr1, expr2) -%} (similar to this)
  3. Adding the appropriate dispatch definition (similar to this)

@Fleid Fleid added the help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors label Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Team:Adapters Issues designated for the adapter area of the code utils Cross-database building blocks
Projects
None yet
Development

No branches or pull requests

6 participants