Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2075] [Feature] Add column aliasing (identifiers) for source tables #6929

Closed
Fleid opened this issue Feb 9, 2023 · 7 comments
Closed
Assignees
Labels
enhancement New feature or request Refinement Maintainer input needed stale Issues that have gone stale

Comments

@Fleid
Copy link
Contributor

Fleid commented Feb 9, 2023

Follow up of the conversation in dbt-labs/dbt-bigquery#365

How about dbt starts supporting column level alias / identifier in sources?
So that if a field is called my_original_column_name in the actual source, it will always be replaced by my_new_column_name in every dbt contexts.

sources:
  - name: my_source_A
    tables:
    - name: a_table_in_A
      columns:
        - name: my_new_column_name
           identifier: my_original_column_name
          ...

From there, my_new_column_name is replaced by my_original_column_name in every DDL/DML/DQL statements issued to the database where the source table is ref'd (used in FROM or JOIN).

In dbt land, the name of that column is always my_new_column_name. This includes when dbt generates database objects based of the original schema (see original issue, persisting test results in dbt__test_audit). As far as dbt knows, this column was always named my_new_column_name.

Describe alternatives you've considered

Alternatives are described in the original issue.

Who will this benefit?

Defining tests on sources that surface system columns, trigger the creation of database objects using reserved keywords. This would allow dbt to bypass that.

@Fleid Fleid added enhancement New feature or request triage labels Feb 9, 2023
@github-actions github-actions bot changed the title [Feature] Add column aliasing (identifiers) for source tables [CT-2075] [Feature] Add column aliasing (identifiers) for source tables Feb 9, 2023
@jtcohen6
Copy link
Contributor

@Fleid Love to see you opening dbt-core issues ;)

Should we do this? Is this preferable to a "staging model" pattern, where users write a bit of SQL (instead of yaml) to rename (& clean) all the columns in the raw source table? That staging model is a lightweight wrapper (materialized as a view, or ephemerally), but then you can also test it, instead of testing the source table directly.

Maybe that feels like overkill if you just need to rename one column, but I don't think so. This is also a way of providing yourself with an extra abstraction layer, for handling more significant changes in the raw source table than just a rename / enabling dependency inversion in the case they switch ingestion providers.


How would we do this? I'm guessing this would require a new macro, or override to the {{ source() }} macro (!), which looks up the source's defined columns in the manifest / graph, and returns it as a subquery. Something like:

select * from {{ source('my_source_A, 'a_table_in_A') }}

Which currently compiles to (imagining BigQuery syntax):

select * from `my_source_A`.`a_table_in_A`

But would instead compile to:

select * from (
    select
        my_original_column_name as my_new_column_name,
        ...
    from `my_source_A`.`a_table_in_A`
)

@jtcohen6 jtcohen6 added Refinement Maintainer input needed and removed triage labels Feb 10, 2023
@Fleid
Copy link
Contributor Author

Fleid commented Feb 10, 2023

I am a bit uneasy about this whole idea to be honest (from the top, not your specific proposal that I find elegant).

In my mental model, sources are made to surface (or maybe in the future generate...) the EL part of ELT. The more we put T capabilities in them, the more you should think about moving that logic into a model.

I like putting restrictions/constraints on the ergonomics, so as a user you can't end in that bad place. We can do that in YAML land because we control the schema of it. We can't in SQL land.

Anyway happy to leave this in refinement for now. We (I) will have to revisit the topic for when we explore "managed sources" aka managing external stuff from dbt.

@jtcohen6
Copy link
Contributor

Agreed with the last point! Where I could see this being useful is around the "managed" sources that require some proactive definition & operation from dbt. Even then, I've really hesitated about encoding transformation/business logic into sources because they ought to be an un-opinionated loading step. But I remember several folks asking for this capability in the dbt-external-tables package (before I abdicated my maintainership of it late last year), e.g. dbt-labs/dbt-external-tables#140

@Fleid Fleid self-assigned this Feb 11, 2023
@alison985
Copy link

First, I want to say that what @jtcohen6 described(in their initial response in this thread) would save soooooooo much work. I can understand not wanting to do transforms in YML, but it's a) already supported for tables and b) if, in some way, it is limited to only supporting field name substitutions then it still keeps logic out of the YML layer. (Probably a regex to remove special characters from the string value?) From the perspective of having 100+ tables just from one data source, making all those models just to alias columns is such a complete waste of time. Then you have to maintain the table structure in YML for the tests and the SQL to pass the field between transformation phases. Especially at the speed that software develops, handling all the db schema change management is extremely negative ROI. 😢

Second, I would also like to raise a cross-database and/or cross-dialect use cases.

For reasons, I have the same ~100 tables in two different database dialects with very different field names. I have tests, etc. already defined for one set of field names. I don't want to copy and paste that YML file, change the field names, copy and paste all the models, change those field names, and then have 4 files to keep in sync(original YML, original SQL, point-in-time YML copy, point in time SQL copy). Also, the additional parsing time (though dbt ~1.5 fixes the bug where it would always re-parse everything instead of just the diff 👏 ). I need a way to alias column names to avoid all that non-DRYness.

Now, arguably, this could be done via inheritance(see #6527), but:

  1. supporting column level identifiers: may be easier/faster to implement
  2. would support use cases like what @Fleid described originally, related to reserved keywords
  3. the incremental solution for YML inheritance from jtcohen6 here still requires a lot of re-typing. I just want to flip field names by connection/db/schema location. That conversation focuses on doc blocks through transformation phases, and other semi-related items.
  4. it's not bad to have multiple ways to do things
  5. may have other use cases that haven't been brought up yet

Another use case thoughts: Needing to support case-sensitivity of column names because they're in a legacy database that uses capitals and it's being ported to a database that lower-cases column names. Even if you only have to support the legacy database names during the migration, supporting column name aliases saves so much time while sticking to a DRY philosophy.

Copy link
Contributor

github-actions bot commented Jan 4, 2024

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jan 4, 2024
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2024
@dkrapohl
Copy link

Hello. I do have a scenario where column aliasing would be helpful for source models -- headerless csv file loading into S3 data lakes. Fivetran's SFTP connector is one example. They generate column names themselves with pattern "column_0", "column_1", etc. There is no option to modify these names. Documentation of it is https://fivetran.com/docs/connectors/files#headerlessfilesoptional

Having the ability to alias could allow us to not only address the columns by meaningful names but also help with portability when/if we migrate to a different acquisition method. I can make a macro to set the column name and alias to be identical making migration fairly trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Refinement Maintainer input needed stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

4 participants