[CT-2075] [Feature] Add column aliasing (identifiers) for source tables #6929

Fleid · 2023-02-09T21:55:28Z

Follow up of the conversation in dbt-labs/dbt-bigquery#365

How about dbt starts supporting column level alias / identifier in sources?
So that if a field is called my_original_column_name in the actual source, it will always be replaced by my_new_column_name in every dbt contexts.

sources:
  - name: my_source_A
    tables:
    - name: a_table_in_A
      columns:
        - name: my_new_column_name
           identifier: my_original_column_name
          ...

From there, my_new_column_name is replaced by my_original_column_name in every DDL/DML/DQL statements issued to the database where the source table is ref'd (used in FROM or JOIN).

In dbt land, the name of that column is always my_new_column_name. This includes when dbt generates database objects based of the original schema (see original issue, persisting test results in dbt__test_audit). As far as dbt knows, this column was always named my_new_column_name.

Describe alternatives you've considered

Alternatives are described in the original issue.

Who will this benefit?

Defining tests on sources that surface system columns, trigger the creation of database objects using reserved keywords. This would allow dbt to bypass that.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2023-02-10T08:17:09Z

@Fleid Love to see you opening dbt-core issues ;)

Should we do this? Is this preferable to a "staging model" pattern, where users write a bit of SQL (instead of yaml) to rename (& clean) all the columns in the raw source table? That staging model is a lightweight wrapper (materialized as a view, or ephemerally), but then you can also test it, instead of testing the source table directly.

Maybe that feels like overkill if you just need to rename one column, but I don't think so. This is also a way of providing yourself with an extra abstraction layer, for handling more significant changes in the raw source table than just a rename / enabling dependency inversion in the case they switch ingestion providers.

How would we do this? I'm guessing this would require a new macro, or override to the {{ source() }} macro (!), which looks up the source's defined columns in the manifest / graph, and returns it as a subquery. Something like:

select * from {{ source('my_source_A, 'a_table_in_A') }}

Which currently compiles to (imagining BigQuery syntax):

select * from `my_source_A`.`a_table_in_A`

But would instead compile to:

select * from (
    select
        my_original_column_name as my_new_column_name,
        ...
    from `my_source_A`.`a_table_in_A`
)

Fleid · 2023-02-10T17:16:28Z

I am a bit uneasy about this whole idea to be honest (from the top, not your specific proposal that I find elegant).

In my mental model, sources are made to surface (or maybe in the future generate...) the EL part of ELT. The more we put T capabilities in them, the more you should think about moving that logic into a model.

I like putting restrictions/constraints on the ergonomics, so as a user you can't end in that bad place. We can do that in YAML land because we control the schema of it. We can't in SQL land.

Anyway happy to leave this in refinement for now. We (I) will have to revisit the topic for when we explore "managed sources" aka managing external stuff from dbt.

jtcohen6 · 2023-02-10T19:11:51Z

Agreed with the last point! Where I could see this being useful is around the "managed" sources that require some proactive definition & operation from dbt. Even then, I've really hesitated about encoding transformation/business logic into sources because they ought to be an un-opinionated loading step. But I remember several folks asking for this capability in the dbt-external-tables package (before I abdicated my maintainership of it late last year), e.g. dbt-labs/dbt-external-tables#140

alison985 · 2023-07-07T20:13:00Z

First, I want to say that what @jtcohen6 described(in their initial response in this thread) would save soooooooo much work. I can understand not wanting to do transforms in YML, but it's a) already supported for tables and b) if, in some way, it is limited to only supporting field name substitutions then it still keeps logic out of the YML layer. (Probably a regex to remove special characters from the string value?) From the perspective of having 100+ tables just from one data source, making all those models just to alias columns is such a complete waste of time. Then you have to maintain the table structure in YML for the tests and the SQL to pass the field between transformation phases. Especially at the speed that software develops, handling all the db schema change management is extremely negative ROI. 😢

Second, I would also like to raise a cross-database and/or cross-dialect use cases.

For reasons, I have the same ~100 tables in two different database dialects with very different field names. I have tests, etc. already defined for one set of field names. I don't want to copy and paste that YML file, change the field names, copy and paste all the models, change those field names, and then have 4 files to keep in sync(original YML, original SQL, point-in-time YML copy, point in time SQL copy). Also, the additional parsing time (though dbt ~1.5 fixes the bug where it would always re-parse everything instead of just the diff 👏 ). I need a way to alias column names to avoid all that non-DRYness.

Now, arguably, this could be done via inheritance(see #6527), but:

supporting column level identifiers: may be easier/faster to implement
would support use cases like what @Fleid described originally, related to reserved keywords
the incremental solution for YML inheritance from jtcohen6 here still requires a lot of re-typing. I just want to flip field names by connection/db/schema location. That conversation focuses on doc blocks through transformation phases, and other semi-related items.
it's not bad to have multiple ways to do things
may have other use cases that haven't been brought up yet

Another use case thoughts: Needing to support case-sensitivity of column names because they're in a legacy database that uses capitals and it's being ported to a database that lower-cases column names. Even if you only have to support the legacy database names during the migration, supporting column name aliases saves so much time while sticking to a DRY philosophy.

github-actions · 2024-01-04T01:46:55Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2024-01-11T01:47:36Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

dkrapohl · 2024-12-23T19:12:53Z

Hello. I do have a scenario where column aliasing would be helpful for source models -- headerless csv file loading into S3 data lakes. Fivetran's SFTP connector is one example. They generate column names themselves with pattern "column_0", "column_1", etc. There is no option to modify these names. Documentation of it is https://fivetran.com/docs/connectors/files#headerlessfilesoptional

Having the ability to alias could allow us to not only address the columns by meaningful names but also help with portability when/if we migrate to a different acquisition method. I can make a macro to set the column name and alias to be identical making migration fairly trivial.

Fleid added enhancement New feature or request triage labels Feb 9, 2023

github-actions bot changed the title ~~[Feature] Add column aliasing (identifiers) for source tables~~ [CT-2075] [Feature] Add column aliasing (identifiers) for source tables Feb 9, 2023

Fleid mentioned this issue Feb 9, 2023

[CT-1415] [Feature] Add adapter-specific aliasing for Google pseudo columns dbt-labs/dbt-bigquery#365

Closed

3 tasks

jtcohen6 added Refinement Maintainer input needed and removed triage labels Feb 10, 2023

Fleid self-assigned this Feb 11, 2023

alison985 mentioned this issue Jul 7, 2023

[CT-2149] [Feature] Add the possibility to map a single dbt entity to multiple databases #7021

Closed

3 tasks

adam-campbell-mfe mentioned this issue Oct 25, 2023

Snowflake External Tables - Ability to add in custom col_expression dbt-labs/dbt-external-tables#140

Closed

github-actions bot added the stale Issues that have gone stale label Jan 4, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2024

vvcb mentioned this issue Mar 30, 2024

Add stg model for Allergies table OHDSI/dbt-synthea#28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-2075] [Feature] Add column aliasing (identifiers) for source tables #6929

[CT-2075] [Feature] Add column aliasing (identifiers) for source tables #6929

Fleid commented Feb 9, 2023 •

edited

Loading

jtcohen6 commented Feb 10, 2023

Fleid commented Feb 10, 2023

jtcohen6 commented Feb 10, 2023

alison985 commented Jul 7, 2023

github-actions bot commented Jan 4, 2024

github-actions bot commented Jan 11, 2024

dkrapohl commented Dec 23, 2024

[CT-2075] [Feature] Add column aliasing (identifiers) for source tables #6929

[CT-2075] [Feature] Add column aliasing (identifiers) for source tables #6929

Comments

Fleid commented Feb 9, 2023 • edited Loading

Describe alternatives you've considered

Who will this benefit?

jtcohen6 commented Feb 10, 2023

Fleid commented Feb 10, 2023

jtcohen6 commented Feb 10, 2023

alison985 commented Jul 7, 2023

github-actions bot commented Jan 4, 2024

github-actions bot commented Jan 11, 2024

dkrapohl commented Dec 23, 2024

Fleid commented Feb 9, 2023 •

edited

Loading