[CT-1229] [Feature] Cross-database `cast` macro #84

dbeatty10 · 2022-09-22T21:05:31Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Per conversation with @jtcohen6 we are interested in creating a cast() macro meant to behave like the CAST operator described in the SQL standard.

Usage likely to be something like:

{{ cast("some expression", type_string()) }}

Which some databases might render as:

cast(some expression as TEXT)

Describe alternatives you've considered

Not providing this abstraction at all
Providing a GitHub gist containing this abstraction and end users can copy-paste into their own dbt packages/projects

Who will this benefit?

Like all cross-database macros, we anticipate this being most relevant to dbt package maintainers.

It would also be useful in implementing this:

[CT-816] [Bug] Incremental materialization - cast column when changing column type dbt-core#5445 (comment)

Are you interested in contributing this feature?

Yep!

Anything else?

N/A

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2022-09-25T13:46:27Z

Option X

Calling it for a known, explicit type would look like this:

{{ cast(expression, type_integer()) }}

Calling it for a variable data type would look like this:

{{ cast(expression, api.Column.translate_type(data_type)) }}

The default implementation might look like this:

-- core/dbt/include/global_project/macros/utils/cast.sql
{% macro cast(expression, data_type) %}
    {{ adapter.dispatch('cast', 'dbt') (expression, data_type) }}
{% endmacro %}

{% macro default__cast(expression, data_type) -%}
    cast({{ expression }} as {{ data_type }})
{%- endmacro %}

Option Y

Calling it for an explicit type would look like this:

{{ cast(expression, "integer") }}

Calling it for a variable data type would look like this:

{{ cast(expression, data_type) }}

The default implementation would look like:

-- core/dbt/include/global_project/macros/utils/cast.sql
{% macro cast(expression, data_type) %}
    {{ adapter.dispatch('cast', 'dbt') (expression, data_type) }}
{% endmacro %}

{% macro default__cast(expression, data_type) -%}
    cast({{ expression }} as {{ api.Column.translate_type(data_type) }})
{%- endmacro %}

jtcohen6 · 2022-09-29T09:29:57Z

@dbeatty10 Love the way you're thinking about this.

I like Option Y. It's elegant, and it should return the right result given all inputs. The only case where it will be incorrect is if the adapter's type translation code is wrong, by asserting that two types are aliases when they're not identical.

The only downside I can think of is surprise for end users. Imagine the following scenario on BigQuery:

# user writes this code
{{ cast(some_column, "integer") }}
# expects to see
cast(some_column as integer)
# actually sees (in logs / compiled code)
cast(some_column as int64)

Which is ... actually correct! Just potentially surprising, that there's an implicit type translation happening. (To be fair, BigQuery now supports integer as an alias for int64, so the original value would have worked too.)

I don't think we need a "back-out" to avoid that. It would be a bug in the adapter's Column implementation, and should be treated as such.

dbeatty10 · 2023-03-09T13:40:29Z

Documenting a little more research here.

dbt.safe_cast()
Column.literal()
Other related issues

`dbt.safe_cast()`

We already have a safe_cast macro:
https://github.com/dbt-labs/dbt-core/blob/a02db03f458aa19529fe3a28f62efa6f78ed3c87/core/dbt/include/global_project/macros/utils/safe_cast.sql#L5-L9

`Column.literal()`

Also, there there is a literal() method on the Column class:
https://github.com/dbt-labs/dbt-core/blob/0ef9931d197c996ccd4e3caf641729176eceda99/core/dbt/adapters/base/column.py#L104-L105

At least a couple adapters override this method, namely dbt-bigquery and dbt-spark.

It seems like the safer default implementation would have been cast({} as {}). Then adapters like dbt-postgres, etc could override with {}::{} if desired.

It appears to have supported dbt snapshots back when they were called "archives" by casting a string 'null' into a NULL value of the applicable timestamp data type:

{{ timestamp_column.literal('null') }} as tmp_valid_to

(NULL is used by dbt snapshots to indicate "not yet"/ "the distant future" / "never" / "infinity".)

Based on its provenance, it's possible that Column.literal() is no longer in use and could be a candidate for deprecation.

Other related issues

See also:

[CT-1052] [dbt-util] add cast_string_to_date as cross-db macro dbt-core#5675

alison985 · 2023-08-15T02:24:00Z

@dbeatty10 I'm working trying to implement this functionality in my dbt project right now to cross Redshift and SQL Server for a few very specific examples.

Could you please explain this more? "Calling it for a variable data type would look like this:" Why/when would I use a variable data type that isn't something like type_integer()? Why would I ever want to use an api call to get a type? e.g. api.Column.translate_type Which also makes me wonder: what are the risks of that api being changed over time?

One thing I would highlight for you and @jtcohen6 is that strings and integers are easy and anything about timestamps and timezones are HARD. They're a better concept to use for anything cross-SQL dialect. I was in charge of scheduling software for two years of my life, please believe me when I say timezones may be the most bug-prone, infuriating area of programming. Knowing that, I did this analysis last year for Redshift because each column below does something different. 😭 😭 😭

Ideally, we want it to align to the output of current_timestamp() as Redshift(which should be using UTC in each cluster). That would save us a lot of logic in diff functions between created_s3_at and created_redshift_at over time. That said, as long as we can identify what the difference is and it is consistent, then we can handle it.
I evaluated the subtitles of time in Redshift the other day. For some insight, into the considerations:

select 
    current_setting('timezone')
    , current_date                                      --current_date() errors out

    --No milliseconds and no timezone
    , getdate()                                         --getdate does not exist

    --milliseconds no timezone (because UTC is the currently set timezone)
    , sysdate                                           --timestamp format
    , current_timestamp at time zone 'UTC' as ct_utc
    , current_timestamp::timestamptz at time zone 'UTC' as ct_tz_utc
    , sysdate::timestamptz at time zone 'UTC' as tz_utc
    , sysdate::timestamptz at time zone 'utc' as tz_utc_lowercase

    --With milliseconds and timezone
    , current_timestamp at time zone 'UTC'::timestamptz as ct_utc_tz
    , now()                                             
    , sysdate at time zone 'UTC'        as utc
    , current_timestamp                 as ct 
    , current_timestamp::timestamptz    as ct_tz
    , sysdate::timestamptz                              --timestamptz format


    -- UTC to CDT
        -- In CDT with timezone in the value
        , sysdate::timestamptz at time zone 'CDT'::timestamptz as tz_cdt_tz

        -- In CDT no timezone in the value
        , sysdate::timestamptz at time zone 'cdt' as tz_cdt_lowercase

    -- CDT to UTC
        -- In CDT time(the next day) with timezone in the value
        , sysdate at time zone 'cdt' as cdt 
        , sysdate at time zone 'CDT'::timestamptz as cdt_tz

From this evaluation I picked the current_timestamp() generated by Redshift with the default(s) set to UTC as what to standardize to.

(Now to throw a wrench in the whole thing Redshift is deprecating current_timestamp() for getdate(). Because of course when you want a timestamp you'd call a date function... 🤦 )

For your two options above, my first impression was Option X was better because type_timestamp() and current_timestamp and convert_timezone and now(note this isn't the main branch but is tsql-utils pinned version) already have overrides in other database adapters. But maybe Option Y is better because putting "integer" in the model file is more declarative, more control, fewer keystrokes, and more assessable to dbt beginners. I think the answer to my question above would help me and further clarify the ticket.

alison985 · 2023-08-15T04:16:36Z

My very specific cases at the moment are:

A) Cast a null value to a timestamp to make sure the table creates that field with the right data type. (I seem to remember BigQuery being the worst when having to deal with this.)

By default, my brain types null::timestamp. However, SQL Server doesn't support ::.
Redshift and SQL Server can handle cast(null as timestamp)
But timezones...
And a wrench: There's a SQL Server error that says "A table can only have one timestamp column. Because table 'apple' already has one, the column 'modified_dw_at' cannot be added." What it really wants is datetime not timestamp.

B) Cast the current_timestamp(or a field with a value that's already a timestamp) to a timestamp. Think coalesce() and concat() cases, or in the future timestamp to timestamptz.

Redshift: current_timestamp::timestamp or cast(current_timestamp as timestamp) or cast(getdate() as timestamp)
SQL Server: cast(getdate() as datetime)

C) Cast a timestamp to a timestamp that either maintains or adds its timezone.
D) Cast a timestamp without timezone to a timestamp with a different timezone then the original undeclared one.

Note that:

{{ dbt_date.date_part() }} failed as a cast hack for me, at least at one point.
I'm keeping the cast() in the model itself for now.
{{ dbt_date.now('UTC') }} doesn't work because it assumes current_timestamp, doesn't have a dispatch and hence no override. There's also a 'dict object' has no attribute 'now' error.
I'm currently using env_var() in vars() based on if statement for which dialect I'm in. #KnowTheRulesBeforeYouBreakThem

[Preview](https://docs-getdbt-com-git-dbeatty10-patch-3-dbt-labs.vercel.app/reference/dbt-jinja-functions/cross-database-macros) > [!NOTE] > I didn't make much of an attempt at versioning this thoughtfully. So please update this as-needed to bring it in line with expectations. ## What are you changing in this pull request and why? dbt-labs/dbt-adapters#84 was implemented in commit dbt-labs/dbt-adapters@5a50be7 within PR dbt-labs/dbt-adapters#55 So this docs PR adds it to the listing of [cross-database macros](https://docs.getdbt.com/reference/dbt-jinja-functions/cross-database-macros). ## Checklist - [x] Review the [Content style guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md) so my content adheres to these guidelines.

dbeatty10 added enhancement New feature or request triage labels Sep 22, 2022

dbeatty10 self-assigned this Sep 22, 2022

dbeatty10 removed the triage label Sep 22, 2022

github-actions bot changed the title ~~[Feature] Cross-database cast macro~~ [CT-1229] [Feature] Cross-database cast macro Sep 22, 2022

dbeatty10 mentioned this issue Feb 7, 2023

[CT-816] [Bug] Incremental materialization - cast column when changing column type dbt-labs/dbt-core#5445

Closed

1 task

dbeatty10 mentioned this issue Jul 27, 2023

[CT-2855] [Feature] dbt_utils.date_spine should be added to core given new MetricFlow requirements dbt-labs/dbt-core#8172

Closed

3 tasks

graciegoheen mentioned this issue Jan 29, 2024

don't require double quotes around numbers in unit testing yml dbt-labs/dbt-snowflake#891

Closed

2 tasks

dbeatty10 mentioned this issue Feb 1, 2024

[Bug] Contract enforcement results in an error when data_type: GEOGRAPHY dbt-labs/dbt-snowflake#894

Closed

2 tasks

dbeatty10 mentioned this issue Feb 9, 2024

[Snowflake] test types for unit testing dbt-labs/dbt-snowflake#896

Merged

4 tasks

mikealfare transferred this issue from dbt-labs/dbt-core Feb 13, 2024

dbeatty10 added user docs and removed user docs labels Apr 13, 2024

This was referenced Apr 13, 2024

Cross-database cast macro dbt-labs/docs.getdbt.com#5278

Merged

Add tests for cross-database cast macro #173

Merged

dbeatty10 mentioned this issue Apr 18, 2024

[Epic] Cross-database cast macro dbt-labs/dbt-core#9976

Open

dbeatty10 closed this as completed in #173 Apr 19, 2024

dbeatty10 mentioned this issue May 21, 2024

case-insensitive comparisons in unit testing, base unit testing test #55

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1229] [Feature] Cross-database `cast` macro #84

[CT-1229] [Feature] Cross-database `cast` macro #84

dbeatty10 commented Sep 22, 2022 •

edited

Loading

dbeatty10 commented Sep 25, 2022

jtcohen6 commented Sep 29, 2022

dbeatty10 commented Mar 9, 2023

alison985 commented Aug 15, 2023

alison985 commented Aug 15, 2023

[CT-1229] [Feature] Cross-database cast macro #84

[CT-1229] [Feature] Cross-database cast macro #84

Comments

dbeatty10 commented Sep 22, 2022 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

dbeatty10 commented Sep 25, 2022

Option X

Option Y

jtcohen6 commented Sep 29, 2022

dbeatty10 commented Mar 9, 2023

dbt.safe_cast()

Column.literal()

Other related issues

alison985 commented Aug 15, 2023

alison985 commented Aug 15, 2023

[CT-1229] [Feature] Cross-database `cast` macro #84

[CT-1229] [Feature] Cross-database `cast` macro #84

dbeatty10 commented Sep 22, 2022 •

edited

Loading

`dbt.safe_cast()`

`Column.literal()`