Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema_on_Change parameter causes model to fail #486

Open
Christy1984 opened this issue Dec 18, 2024 · 2 comments
Open

Schema_on_Change parameter causes model to fail #486

Christy1984 opened this issue Dec 18, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Christy1984
Copy link

Christy1984 commented Dec 18, 2024

Describe the bug

I am running the a model with following configurations (note file_format is Iceberg):

{{ config(
    materialized='incremental',
    incremental_strategy='append',  
    on_schema_change='append_new_columns',
    partition_by=['extract_date'],
    pre_hook="""
    {{ delete_if_table_exists(this, delete_date=var('run_date', 'date_sub(current_date(), 1)')) }}
    """
) }}

When running this i get the error message:

lue adapter: Glue returned `error` for statement None for code SqlWrapper2.execute('''describe lindy_raw.raw_pwh_actors_tmp''', use_arrow=False, location='s3://sb1-prod-dev-transaction-prod-dev-data'), AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `lindy_raw`.`raw_pwh_actors_tmp` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 1 pos 9;

I assume that this worked in 1.8.6 because it was completely ignored, but now is taken into consideration.

When deleting on_schema_change='append_new_columns' from the configuration it works again, but i get a new error see issue 487.

@Christy1984 Christy1984 added the bug Something isn't working label Dec 18, 2024
@bhawesh490
Copy link

Hi any update on this?

@Christy1984
Copy link
Author

Christy1984 commented Feb 10, 2025

1. Issue with Temporary Table Creation when using schema_on_change

When running with schema_on_change, one of the first steps is to generate a temporary table. This occurs in the following adapter macros:

{% macro glue__create_table_as(temporary, relation, sql) -%}
  {% if temporary -%}  {# This needs to be changed: works if {% if temporary == true -%}, input is too flexible #}
    {{ create_temporary_view(relation, sql) }}
  {%- else -%}
    {%- set file_format = config.get('file_format', validator=validation.any[basestring]) -%}
    {%- set table_properties = config.get('table_properties', default={}) -%}

    {%- set create_statement_string -%}
      {% if file_format in ['delta', 'iceberg'] -%}
        create or replace table
      {%- else -%}
        create table
      {% endif %}
    {%- endset %}

    {{ create_statement_string }} {{ relation }}  {# This needs to contain a schema reference #}
    {% set contract_config = config.get('contract') %}
    {% if contract_config.enforced %}
      {{ get_assert_columns_equivalent(sql) }}
      {# This does not enforce constraints and needs to be addressed #}
      {# We'll need to modify the query because with CREATE TABLE AS SELECT, columns are not explicitly specified #}
    {% endif %}
    
    {{ glue__file_format_clause() }}
    {{ partition_cols(label="partitioned by") }}
    {{ clustered_cols(label="clustered by") }}
    {{ set_table_properties(table_properties) }}
    {{ glue__location_clause() }}
    {{ comment_clause() }}
    as
    {{ sql }}
  {%- endif %}
{%- endmacro -%}

Temporary Physical Table Creation:

{% macro glue__create_temporary_view(relation, sql) -%}
  create or replace table glue_catalog.{{ relation.include(schema=true) }} as  {# Added glue_catalog and schema reference #}
    {{ sql }}
{% endmacro %}

Loading Data from the Temporary Table:

{% macro get_insert_into_sql(source_relation, target_relation) %}
    {%- set dest_columns = adapter.get_columns_in_relation(target_relation) -%}
    {%- set dest_cols_csv = dest_columns | map(attribute='name') | join(', ') -%}
    insert into table {{ target_relation }}
    select {{ dest_cols_csv }}
    from glue_catalog.{{ source_relation.schema }}.{{ source_relation }}  {# Added glue_catalog and schema reference #}
{% endmacro %}

Problem Explanation:

A key issue lies in the condition {% if temporary -%}. This results in the creation of only a temporary view, which Iceberg cannot utilize to compare columns. This causes failures when identifying new columns.

When creating a physical temporary table instead, the insert into statement still references the temporary view, causing conflicts. This logic needs to be revisited to ensure consistent handling based on whether the table is temporary or permanent.


2. Execution Fails on First Run When Adding a New Column but Works on Subsequent Runs

When adding a new column, the first run fails immediately after the column is added. This seems to occur because the ALTER TABLE ADD COLUMN command does not return metadata in the subsequent call:

Glue adapter: GlueCursor fetchall results=[]

In contrast, calls like SHOW TABLES IN glue_catalog.lindy1_raw LIKE 'xxx' return:

GlueCursor fetchall results=['namespace', 'tableName', 'isTemporary']

This indicates that after adding a column, metadata needs to be refreshed to reflect the changes. This likely requires a new fetch or refresh operation in the Python adapter, which cannot be solved through macros alone.

On subsequent runs, since both the metadata and table structure are now consistent, execution proceeds without issues.


3. Adding New Columns Only at the End of a Table (Iceberg Constraint)

The following macro needs to be extended to address Iceberg’s requirement that new columns can only be added at the end of a table:

{% macro default__alter_relation_add_remove_columns(relation, add_columns, remove_columns) %}

To meet this constraint, use the syntax:

ALTER TABLE table_name ADD COLUMN new_column AFTER existing_column;

This will ensure that columns are appended correctly per Iceberg’s requirements.


4. Optional: Support for Data Type Changes in Iceberg (If Supported)

In many cases, changing data types doesn’t result in data loss. Where Iceberg supports such changes, dbt-glue should accommodate them.

This can be achieved by maintaining a dictionary mapping which data type changes are allowed. For instance:

supported_type_changes = {
  "int": ["bigint", "float"],
  "float": ["double"],
  "string": ["text", "varchar"]
}

A second phase, this could involve implementing a add/drop logic to handle more complex data type changes based on user preferences (even if data is lost). I can provide a detailed specification for this if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants