Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3243] [Bug] Unrecognized data types break get_column_schema_from_query macro (and contracts) #8877

Closed
2 tasks done
rlh1994 opened this issue Oct 23, 2023 · 3 comments · Fixed by #8887
Closed
2 tasks done
Labels
bug Something isn't working model_contracts

Comments

@rlh1994
Copy link
Contributor

rlh1994 commented Oct 23, 2023

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Currently the get_column_schema_from_query macro, which is used as part of contracts, but also is available for any users to call (although not documented) will fail when it encounters an unknown data type code. This has already been raised as a specific case for postgres uuid type in dbt-labs/dbt-postgres#54 and a fix is in place for both this and the ip_address type in #8357. However this issue is more generic to support any potential non-defined type rather than specifics.

My concern is that if postgres adds any more types in the future this issue will persist and a new PR is going to need to be raised and merged in before this macro can be used or a contract can be enforced on the model. This may also impact other warehouses although I am not sure exactly how their column type mappings are implemented. Already in #8357 someone has mentioned the money type has the same issue.

In addition to failing, it does so in a very unhelpful way, with the error message literally just being the type code from the warehouse, if the behaviour won't change it would be nice to have a more specific error message.

Expected Behavior

I see a few options that may be possible in the case of getting a type code that is not in the type dictionary that dbt is using.

  1. When no match is returned simply return the type code itself, this would allow the macro to complete and contracts to be enforced, however the user would need to define the contract using the type code instead of the type name. The dbt docs could list all supported (postgres?) types and detail what to do for non-supported types. This could lead to issues later if those types are added later and the user upgrades. It would be more work, but if contract checks supported both type codes and type names (auto converting from name to code as codes would always be available) this would solve that issue, but would require changing some of the contract code I suspect?
  2. Return something like "unknown", however this would go against the idea of contracts and causes the same later supported issues.
  3. The macro would still fail, but the data_type_code_to_name method would produce a more useful error message. This is not ideal as it will still fail if any new types are in a table, but at least it is more obvious why it is failing.

Steps To Reproduce

  1. Create a new dbt project, using a postgres connection
  2. Create a new model with the following content:
    {% set uuid_col = get_column_schema_from_query( "select * from (select 'f76838a4-7411-4f04-a960-4fc77fd3107b'::uuid as test_col) a") %}
    
    select 1
    
  3. Do a dbt run, get the 2950 error despite no contract being used in the package.

Relevant log output

09:05:33  Running with dbt=1.6.3
09:05:33  Registered adapter: postgres=1.6.3
09:05:34  Found 1 model, 0 sources, 0 exposures, 0 metrics, 349 macros, 0 groups, 0 semantic models
09:05:34  
09:05:34  Concurrency: 1 threads (target='postgres')
09:05:34  
09:05:34  1 of 1 START sql table model dbt_ryan.uuid_read_test ........................... [RUN]
09:05:34  Unhandled error while executing 
2950
09:05:34  1 of 1 ERROR creating sql table model dbt_ryan.uuid_read_test .................. [ERROR in 0.12s]
09:05:34  
09:05:34  Finished running 1 table model in 0 hours 0 minutes and 0.39 seconds (0.39s).
09:05:34  
09:05:34  Completed with 1 error and 0 warnings:
09:05:34  
09:05:34    2950
09:05:34  
09:05:34  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

Environment

- OS: MacOS
- Python: 3.8.13
- dbt: 1.6.3

Which database adapter are you using with dbt?

postgres

Additional Context

I think this was somewhat discussed in the original PR #6986 but it was decided to progress anyway.

For context the reason we use this macro in our packages explicitly is to get all the columns in a table so that we can rename some, but without having to manually list out all names (because postgres doesn't support exclude syntax, and we don't necessarily know all the columns in the table at this point). https://github.com/snowplow/dbt-snowplow-web/blob/e4ae6b297728abfb2dce5673d540e556b21f68df/models/base/scratch/default/snowplow_web_base_events_this_run.sql#L62

@rlh1994 rlh1994 added bug Something isn't working triage labels Oct 23, 2023
@github-actions github-actions bot changed the title [Bug] New data types break get_column_schema_from_query macro (and contracts) [CT-3243] [Bug] New data types break get_column_schema_from_query macro (and contracts) Oct 23, 2023
@dbeatty10 dbeatty10 changed the title [CT-3243] [Bug] New data types break get_column_schema_from_query macro (and contracts) [CT-3243] [Bug] Unrecognized data types break get_column_schema_from_query macro (and contracts) Oct 23, 2023
@dbeatty10 dbeatty10 self-assigned this Oct 23, 2023
@dbeatty10
Copy link
Contributor

@rlh1994 Thanks for opening this issue, linking to other relevant issues, and laying out multiple potential options!

Potential solution

For dbt-postgres, we could replace this:

@classmethod
def data_type_code_to_name(cls, type_code: int) -> str:
return string_types[type_code].name

With this:

    @classmethod
    def data_type_code_to_name(cls, type_code: int) -> str:
        if type_code in string_types:
            return string_types[type_code].name
        else:
            return f"unknown type_code {type_code}"

Trade-offs

👍 On the plus side:

  • will not raise any error, so an execution of dbt won't break when there is an unrecognized data type code from the python driver

👎 On the negative side:

  • When the type_code is not recognized, then Column.data_type would return an invalid data type like unknown type_code 2950 instead of the actual data type name (due to this code)

In your particular use-case within snowplow_web_base_events_this_run.sql, it looks like you only need the Column name, so the negative side wouldn't affect you 🎉

Reprex

The following reproduction case builds on the example you provided and shows that the potential solution above appears to work with enforced contracts as well.

Click to toggle a reproducible example

models/my_model.sql

{{ config(materialized="table") }}

select
    cast('12.34' as money) as money_col_22,
    cast('a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11' as uuid) as uuid_col_23,
    cast('192.168.100.128/25' as cidr) as cidr_col_24,

models/_models.yml

models:
  - name: my_model
    config:
      contract:
        enforced: true
    columns:
      - name: money_col_22
        data_type: money
      - name: uuid_col_23
        data_type: uuid
      - name: cidr_col_24
        data_type: cidr
dbt run -s my_model

@rlh1994
Copy link
Contributor Author

rlh1994 commented Oct 24, 2023

That works well enough for me, I hadn't looked into how the contracts actually validate but the fact that it still works using the type name is great! I even tested it with a custom user defined type and it works perfectly in contracts!

@alison985
Copy link

I've only skimmed this but it's makes me think of the problem with Redshift Spectrum objects (mostly as sources). Redshift Spectrum objects, at the vast majority of the time, don't give field data types until you materialize them in a different object. You literally can't get the DDL of a Spectrum object. If I'm stuck with having to use a Spectrum object I do this: create table materialize_apple as select * from apple limit 100 and then inspect that table's DDL.

This may or may not be a case to consider here. cc: @dataders

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working model_contracts
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants