[CT-2137] [Feature] Cross-database implementation of is_distinct_from #6997

joellabes · 2023-02-17T02:55:04Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

If I had a dollar for every time I had done

select a, b, 
  a = b or (a is null and b is null) as are_the_same
from table

or

select a, b, 
  coalesce(a, -1) = coalesce(b, -1) as are_the_same --very dangerous! do not do!
from table

,
I would have a lot of money.

Turns out that there is a solution for this in the sql standard!

We should add this as a cross-db macro - a speedy Google looks like most of our key adapters support it natively except for Redshift

Describe alternatives you've considered

Doing nothing.

Who will this benefit?

This seems useful for two reasons:

By creating and documenting it, more people might stumble over it
I would have given my left leg for this when building the first version of the metrics package, when I was trying to do joins that were potentially null

Are you interested in contributing this feature?

Yep! But it'll be my first cross-adapter contribution so I'll need some handholding

Anything else?

No response

dbeatty10 · 2023-02-17T23:49:00Z

Thanks for opening @joellabes !

What kind of syntax are you imagining, something like this?

{{ dbt.is_distinct_from(a, b) }}

Would you imagine it being being paired with is_not_distinct_from too? Or just rely on the user to do the negation?

For the Redshift implementation (or any other non-conforming database), we'd want to be diligent with the implementation.
As you noted as dangerous!, we wouldn't want to do this implementation, but one that works correctly for all possible input values. (I suspect the SQL in that link would be incorrect when one of width & height is 0 and the other is NULL.)

The truth table here would be handy for integration tests.

Side note: both of the SQL examples you gave would be awesome for a SQL linter to alert about!

joellabes · 2023-02-20T02:27:39Z

What kind of syntax are you imagining, something like this?
{{ dbt.is_distinct_from(a, b) }}

Yeah I think that sounds right to me!

Would you imagine it being being paired with is_not_distinct_from too? Or just rely on the user to do the negation?

I imagined that the user would BYON (bring your own negation)

Side note: both of the SQL examples you gave would be awesome for a SQL linter to alert about!

As in it would say "hey did you know there is a better way to do this"? That would be very cool.

dbeatty10 · 2023-02-20T14:25:27Z

Two follow-up questions:

Are there two or three instances where you've seen a this type of logic expressed (like in the metrics package or other packages? This could help us refine if we're getting the syntax right or not, especially if users are writing their own negations.
Do you know what the generic SQL expression would look like for databases that don't natively support is distinct from?

owenprough-sift · 2023-02-22T20:38:42Z

2. Do you know what the generic SQL expression would look like for databases that don't natively support `is distinct from`?

Probably something like what Joel had in the first post:

( ( a = b ) or ( ( a is null ) and ( b is null ) ) ) /* use as many/few parens as make you feel safe, I reckon */

joellabes · 2023-02-22T21:46:28Z

Are there two or three instances where you've seen a this type of logic expressed (like in the metrics package or other packages? This could help us refine if we're getting the syntax right or not, especially if users are writing their own negations.

No, but that's a 100% fair question before putting it into Core. I desperately hope that my code is no longer in the metrics package 😂 (@callum-mcdata?)

I imagine negations would be

select a, b, 
not ({{ dbt.is_distinct_from('a', 'b') }}) as are_the_same
from {{ ref('the_table') }}

and joins would be

select *
from a
inner join b on not ({{ dbt.is_distinct_from('a.c', 'b.c') }})
and not ({{ dbt.is_distinct_from('a.d', 'b.d') }})

Do you know what the generic SQL expression would look like for databases that don't natively support is distinct from?

Yeah agreed with Owen! mo' brackets le' problems, as they say

dbeatty10 · 2023-02-23T02:26:49Z

Names of the features within the standard

Part 2 of the SQL standard (ISO/IEC 9075-2) added these IDs and names in SQL:1999 and SQL:2003, respectively ¹:

a is distinct from b
- T151: "DISTINCT predicate"
- equivalent to saying "a is not equal to b and at least one of them is not null"
a is not distinct from b
- T152: "DISTINCT predicate with negation"
- equivalent to saying "a and b are either equal or both null"

Should we flip it?

All the examples we've looked at so far are actually for T152 rather than T151!

So I'm wondering, maybe we want to actually implement the equivalent of {{ dbt.is_not_distinct_from('a', 'b') }} rather than {{ dbt.is_distinct_from('a', 'b') }}? To know one way or the other, I think we'll want to see how null-safe comparisons are used most often.

Pondering names

There are only two hard things in Computer Science: cache invalidation, naming things, and three-valued logic.

I can't say I fully grok it all, but I think the source of this issue we're trying to solve is that a = b and a != b aren't intuitive to humans because of SQL's three-valued logic.

I'm guessing that referring to T152 with one of the alternatives below is borderline blasphemous for some reason. Can someone explain to me like I'm five why they are a bad idea?

{{ dbt.is('a', 'b') }}
{{ dbt.same('a', 'b') }}
{{ dbt.equals('a', 'b') }}
{{ dbt.eq('a', 'b') }}
etc.

Shortening the name is inspired by SQLite's is and MySQL/MariaDB's <=> spaceship operator.

Using is might(?) actually have some synergy with optional feature F571, "Truth value tests" from the SQL standard.

Let's find some examples

Let's examine some examples from the wild to confirm/deny if T151 or T152 semantics are most common. A good place to search would be the GitHub repos of the packages listed at https://hub.getdbt.com. Some (untested!) case-insensitive ideas for regular expressions to use with grep:

(.*)=(.*)or(.*)is null(.*)is null(.*)
(.*)coalesce(.*)=(.*)coalesce(.*)

callum-mcdata · 2023-02-23T02:48:57Z

Oh dang - I would have loved this functionality about 6 weeks ago when I was thinking about how we could support null-safe joins. More information in this issue here..

The use case for the metrics package was specifically around joining the outputs of two unique metrics queries where the values within a particular column might be null. Users were running into issues when the null values of dimension_a from metric_a weren't able to join to the null values from dimension_a from metric_b. This led to "misreporting" of numbers because the metric associated with the null values wasn't included in the join.

dbeatty10 · 2023-05-11T19:41:19Z

It could be nice to implement this macro to make it easy to express the logic in cases like dbt-labs/dbt-adapters#159, dbt snapshots, etc.

Refinement needed # 1

The biggest outstanding decision is naming.

Pros: equals(a, b) makes a lot more intuitive sense to me than is_not_distinct_from. It's also a lot easier to write than more explicit alternatives like null_safe_equals.

Cons: I'm unaware of the strongest arguments against the simple & intuitive naming, but would welcome that insight!

Summary: I'm attracted to something simple with clear intent like equals rather than is_not_distinct_from.

Refinement needed # 2

The other outstanding decision is whether to implement both the NULL-aware comparisons of equality and inequality or just equality.

(I'm thinking we wouldn't want to do only inequality as originally requested because the SQL is quite a bit more complicated and equality might be more common.)

If we support both, then the syntax and rendered SQL would be like the following:
{{ dbt.equals(a, b) }}

( (a = b) or (a is null and b is null) )

{{ dbt.not_equals(a, b) }}

not ( (a = b) or (a is null and b is null) )

Versus if we only do equality, then inequality would need to be expressed like this:

not {{ dbt.equals(a, b) }}

not ( (a = b) or (a is null and b is null) )

Pros: Doing both might make them easier to discover and use. The implementation of not equals follows trivially from equals. Opportunity for adapters to optimize both implementations if needed.

Cons: Two different macros to maintain, document, and test. Two different ways for the user to express inequality.

Summary: As long as there is a macro for expressing equality, it doesn't seem necessary that we offer a macro for inequality.

joellabes · 2023-05-16T04:41:38Z

Question 1

I am in two minds:

coherent, simple naming is good (a point for dbt.equals or similar, although I would like it to make clear that it's null-aware, which is a demerit from straight dbt.equals)
for these primitives that are already named, reusing the hard work of the SQL Standards team is good (a point for dbt.is_distinct_from)

The good news is that my friends at Snowflake have split the difference, with what I assume is a wrapper over the top of is distinct from: equal_null.

So I would vote for either that, or something like safe_equals in the same vein as safe_add, safe_divide...

Question 2

I agree with you, I don't think we need both - people can prefix a not if they need to.

dbeatty10 · 2023-06-01T02:19:37Z

Naming

Most frequently, you'll find me on "team standards" which would have me throwing my weight behind dbt.is_not_distinct_from as the name for null-safe equals comparison. But it's unintuitive enough that I don't think it will solve our problem here. Same goes for equal_null.

After thinking on this quite a bit, equals feels best. Intuitive. Simple. Easy.

I'd expect it to be rare for people to be surprised about its behavior. Rather, it's simple enough to be used without reading or consulting the documentation. Can't imagine a sweeter spot than that.

Implementation

Due to the trickiness of SQL's three-valued logic, the implementation needs to be like this:

{% macro equals(actual, expected) %}
{# -- actual is not distinct from expected #}
case when {{ actual }} = {{ expected }} or ({{ actual }} is null and {{ expected }} is null)
    then 0
    else 1
end = 0
{% endmacro %}

Since postgres supports is [not] distinct from, it can be used to show why the case-when is necessary to force either a true or a false value (rather than stubbornly propagating null):

Example

with x as (

    select cast(1 as integer) as i union all
    select cast(2 as integer) as i union all
    select cast(null as integer) as i

),

permutations as (

    select
        x1.i as x1_i,
        x2.i as x2_i 
    from x as x1, x as x2
    order by x1.i, x2.i

)

select
    -- values to compare
    x1_i,
    x2_i,

    -- logic for equals macro
    x1_i = x2_i or (x1_i is null and x2_i is null) as equals_logic,

    -- compliant syntax for null-safe equality
    x1_i is not distinct from x2_i as is_not_distinct_from,

    -- case/when logic for equals macro
    case when x1_i = x2_i or (x1_i is null and x2_i is null)
        then 0
        else 1
    end = 0 as case_equals_macro,

    -- negation of logic for equals macro
    not (x1_i = x2_i or (x1_i is null and x2_i is null)) as not_equals_logic,

    -- compliant syntax for null-safe inequality
    x1_i is distinct from x2_i as is_distinct_from,

    -- negation of case/when logic for equals macro
    not case when x1_i = x2_i or (x1_i is null and x2_i is null)
     then 0
     else 1
    end = 0 as case_equals_macro

from permutations
;

github-actions · 2023-11-30T01:47:20Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

dbeatty10 · 2023-11-30T02:09:10Z

Implementation

In #7776, we implemented a null-safe equals macro for the purposes of supporting test cases:

dbt-core/tests/adapter/dbt/tests/adapter/utils/base_utils.py

Lines 5 to 10 in 5488dfb

    
           {% macro equals(expr1, expr2) -%} 
        
           case when (({{ expr1 }} = {{ expr2 }}) or ({{ expr1 }} is null and {{ expr2 }} is null)) 
        
               then 0 
        
               else 1 
        
           end = 0 
        
           {% endmacro %}

Its logic is tested here (which uses these fixtures).

It is only used within functional tests in pytest; it is not currently available to the adapter outside of the pytest environment.

We could resolve this ticket by:

Lifting-and-shift this macro definition into this folder.
Changing the macro name to be {% macro default__equals(expr1, expr2) -%} (similar to this)
Adding the appropriate dispatch definition (similar to this)

joellabes added enhancement New feature or request triage labels Feb 17, 2023

github-actions bot changed the title ~~[Feature] Cross-database implementation of is_distinct_from~~ [CT-2137] [Feature] Cross-database implementation of is_distinct_from Feb 17, 2023

jtcohen6 added Team:Adapters Issues designated for the adapter area of the code utils Cross-database building blocks labels Feb 17, 2023

dbeatty10 self-assigned this Feb 17, 2023

dbeatty10 added awaiting_response and removed triage labels Feb 17, 2023

github-actions bot added triage and removed awaiting_response labels Feb 20, 2023

dbeatty10 added awaiting_response and removed triage labels Feb 20, 2023

github-actions bot added triage and removed awaiting_response labels Feb 22, 2023

dbeatty10 added awaiting_response and removed triage labels Feb 22, 2023

github-actions bot added triage and removed awaiting_response labels Feb 22, 2023

dbeatty10 added awaiting_response and removed triage labels Feb 23, 2023

github-actions bot added triage and removed awaiting_response labels Feb 23, 2023

dbeatty10 removed the triage label Feb 23, 2023

dbeatty10 added the awaiting_response label Feb 23, 2023

github-actions bot added triage and removed awaiting_response labels Feb 23, 2023

dbeatty10 added awaiting_response and removed triage labels Feb 23, 2023

dbeatty10 mentioned this issue May 11, 2023

[CT-2563] [Bug] Incremental updates using unique_key result in duplicates if fields in the unique_key are null dbt-labs/dbt-adapters#159

Open

2 tasks

github-actions bot added triage and removed awaiting_response labels May 11, 2023

dbeatty10 removed the triage label May 11, 2023

github-actions bot added the triage label May 16, 2023

dbeatty10 mentioned this issue May 20, 2023

[Bug] dbt adapter tests for utils unreliable #7670

Closed

2 tasks

dbeatty10 removed the triage label Jun 1, 2023

This was referenced Jun 4, 2023

Cross-database null-safe equals comparison via equals dbt-labs/docs.getdbt.com#3469

Open

Fix not equals comparison to be null-safe for adapters/utils tests #7776

Merged

github-actions bot added the stale Issues that have gone stale label Nov 30, 2023

dbeatty10 removed the stale Issues that have gone stale label Nov 30, 2023

dbeatty10 removed their assignment Nov 30, 2023

dbeatty10 mentioned this issue Jan 29, 2024

[Feature] Postgres 15+ unique index NULLS NOT DISTINCT #9472

Open

3 tasks

Fleid added the help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors label Mar 1, 2024

dbeatty10 mentioned this issue Jul 11, 2024

Update generate_surrogate_key(...) to return NULL when some columns are NULL dbt-labs/dbt-utils#935

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-2137] [Feature] Cross-database implementation of is_distinct_from #6997

[CT-2137] [Feature] Cross-database implementation of is_distinct_from #6997

joellabes commented Feb 17, 2023

dbeatty10 commented Feb 17, 2023

joellabes commented Feb 20, 2023

dbeatty10 commented Feb 20, 2023

owenprough-sift commented Feb 22, 2023

joellabes commented Feb 22, 2023

dbeatty10 commented Feb 23, 2023 •

edited

Loading

callum-mcdata commented Feb 23, 2023

dbeatty10 commented May 11, 2023

joellabes commented May 16, 2023

dbeatty10 commented Jun 1, 2023

github-actions bot commented Nov 30, 2023

dbeatty10 commented Nov 30, 2023

[CT-2137] [Feature] Cross-database implementation of is_distinct_from #6997

[CT-2137] [Feature] Cross-database implementation of is_distinct_from #6997

Comments

joellabes commented Feb 17, 2023

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

dbeatty10 commented Feb 17, 2023

joellabes commented Feb 20, 2023

dbeatty10 commented Feb 20, 2023

owenprough-sift commented Feb 22, 2023

joellabes commented Feb 22, 2023

dbeatty10 commented Feb 23, 2023 • edited Loading

Names of the features within the standard

Should we flip it?

Pondering names

Let's find some examples

callum-mcdata commented Feb 23, 2023

dbeatty10 commented May 11, 2023

Refinement needed # 1

Refinement needed # 2

joellabes commented May 16, 2023

Question 1

Question 2

dbeatty10 commented Jun 1, 2023

Naming

Implementation

github-actions bot commented Nov 30, 2023

dbeatty10 commented Nov 30, 2023

Implementation

dbeatty10 commented Feb 23, 2023 •

edited

Loading