-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-2075] [Feature] Add column aliasing (identifiers) for source tables #6929
Comments
@Fleid Love to see you opening Should we do this? Is this preferable to a "staging model" pattern, where users write a bit of SQL (instead of yaml) to rename (& clean) all the columns in the raw source table? That staging model is a lightweight wrapper (materialized as a view, or ephemerally), but then you can also test it, instead of testing the source table directly. Maybe that feels like overkill if you just need to rename one column, but I don't think so. This is also a way of providing yourself with an extra abstraction layer, for handling more significant changes in the raw source table than just a rename / enabling dependency inversion in the case they switch ingestion providers. How would we do this? I'm guessing this would require a new macro, or override to the select * from {{ source('my_source_A, 'a_table_in_A') }} Which currently compiles to (imagining BigQuery syntax): select * from `my_source_A`.`a_table_in_A` But would instead compile to: select * from (
select
my_original_column_name as my_new_column_name,
...
from `my_source_A`.`a_table_in_A`
) |
I am a bit uneasy about this whole idea to be honest (from the top, not your specific proposal that I find elegant). In my mental model, sources are made to surface (or maybe in the future generate...) the EL part of ELT. The more we put T capabilities in them, the more you should think about moving that logic into a model. I like putting restrictions/constraints on the ergonomics, so as a user you can't end in that bad place. We can do that in YAML land because we control the schema of it. We can't in SQL land. Anyway happy to leave this in |
Agreed with the last point! Where I could see this being useful is around the "managed" sources that require some proactive definition & operation from dbt. Even then, I've really hesitated about encoding transformation/business logic into sources because they ought to be an un-opinionated loading step. But I remember several folks asking for this capability in the |
First, I want to say that what @jtcohen6 described(in their initial response in this thread) would save soooooooo much work. I can understand not wanting to do transforms in YML, but it's a) already supported for tables and b) if, in some way, it is limited to only supporting field name substitutions then it still keeps logic out of the YML layer. (Probably a regex to remove special characters from the string value?) From the perspective of having 100+ tables just from one data source, making all those models just to alias columns is such a complete waste of time. Then you have to maintain the table structure in YML for the tests and the SQL to pass the field between transformation phases. Especially at the speed that software develops, handling all the db schema change management is extremely negative ROI. 😢 Second, I would also like to raise a cross-database and/or cross-dialect use cases. For reasons, I have the same ~100 tables in two different database dialects with very different field names. I have tests, etc. already defined for one set of field names. I don't want to copy and paste that YML file, change the field names, copy and paste all the models, change those field names, and then have 4 files to keep in sync(original YML, original SQL, point-in-time YML copy, point in time SQL copy). Also, the additional parsing time (though dbt ~1.5 fixes the bug where it would always re-parse everything instead of just the diff 👏 ). I need a way to alias column names to avoid all that non-DRYness. Now, arguably, this could be done via inheritance(see #6527), but:
Another use case thoughts: Needing to support case-sensitivity of column names because they're in a legacy database that uses capitals and it's being ported to a database that lower-cases column names. Even if you only have to support the legacy database names during the migration, supporting column name aliases saves so much time while sticking to a DRY philosophy. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
Hello. I do have a scenario where column aliasing would be helpful for source models -- headerless csv file loading into S3 data lakes. Fivetran's SFTP connector is one example. They generate column names themselves with pattern "column_0", "column_1", etc. There is no option to modify these names. Documentation of it is https://fivetran.com/docs/connectors/files#headerlessfilesoptional Having the ability to alias could allow us to not only address the columns by meaningful names but also help with portability when/if we migrate to a different acquisition method. I can make a macro to set the column name and alias to be identical making migration fairly trivial. |
Follow up of the conversation in dbt-labs/dbt-bigquery#365
How about dbt starts supporting column level alias / identifier in sources?
So that if a field is called
my_original_column_name
in the actual source, it will always be replaced bymy_new_column_name
in every dbt contexts.From there,
my_new_column_name
is replaced bymy_original_column_name
in every DDL/DML/DQL statements issued to the database where the source table is ref'd (used in FROM or JOIN).In dbt land, the name of that column is always
my_new_column_name
. This includes when dbt generates database objects based of the original schema (see original issue, persisting test results indbt__test_audit
). As far as dbt knows, this column was always namedmy_new_column_name
.Describe alternatives you've considered
Alternatives are described in the original issue.
Who will this benefit?
Defining tests on sources that surface system columns, trigger the creation of database objects using reserved keywords. This would allow dbt to bypass that.
The text was updated successfully, but these errors were encountered: