-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support persist_docs
config for columns for all core databases
#1722
Comments
Thanks @shivam-darmora |
So I was curious what it would take to do this and found it was beyond my ability to contribute, but I did some spelunking that I'd like to document for anybody picking this up. The first thing to note is that BigQuery Data Definition Language statements do contain an option for the column schema which can contain a column list option for descriptions. "That means that It looks like dbt is using BigQuery API functionality instead of DDL statements to build tables in two ways:
This change is beyond my skill level because it might prompt a refactor. Should we use DDL statements across the board? Should we use the Table method across the board? I really don't know. There are also other ways that dbt interacts with BQ tables beyond just creating new ones, and we would need to account for column descriptions in those as well. It looks like dbt's native Column class is sometimes converted to the google.cloud.bigquery.schema.SchemaField class for BigQuery API calls. The init function for this class does have a parameter for description, but we're not currently calling it. This SchemaField init function is called when:
To determine if descriptions could work with these methods, we need to answer a few questions:
Based on my spelunking, the answers are roadblocks:
In closing, adding column descriptions looks like a PITA but hopefully somebody reads this, says "actually that doesn't sound too bad", and actually goes for it 🙏 🙏 🙏 |
Hey @mferryRV - thanks for this very thoughtful and well-researched comment! My ultimate takeaway is indeed "actually that doesn't sound too bad" :p Couple of things to note: dbt does execute DDL to create tables and views on BigQuery! This DDL looks something like:
Instead, dbt could build a DDL query that looks more like:
This example will create a table with descriptions for both the specified columns and the table itself. The really challenging thing here is that all of the columns in the table need to be specified in the table schema. We could not, for instance, supply the schema for the This sort of leave us with two options:
Neither of these options sound super pleasant IMO. The first option is really tractable for us to implement, but puts a pretty big burden on dbt users! There are some places where I think enumerating a full schema for a model is a good idea, but I hesitate to require something that verbose just to get column descriptions working. The other broad approach is to create the table, then to iterate through the list of columns that contain descriptions in the We can totally add a Last, you can ignore Agate in this context - I don't think it's going to play much of a role in the implementation here. Thanks! Super happy to follow up on this - let me know if there's anything else I can clarify, or if there's anything important I missed here :) |
Thanks for digging into this, Drew! My feeling is that creating the table and then iterating through the columns to add descriptions introduces the feature without making the functionality less intuitive. For example, I'd hate to see errors like: Iterating through after the table already exists would make for more intuitive errors - the table still gets built and the secondary concern of column documentation is addressed later. It also feels like it might be faster than interpreting the column type for every column. |
I think I'm with you here @mferryRV! Requiring users to specify all of the columns in a table is a good feature in its own right, but it's definitely something that should be opted into #1570 I think we can implement this differently on different databases. BigQuery will probably let us update the column comments with a single API call, whereas we might need to dispatch one |
Hi Did someone make any progress on this topic? We are also looking into how to provide column level descriptions to our dbt models that we deploy in BigQuery. thanks |
Hi @bodschut - no progress to-date that I'm aware of. My current thinking is that we should:
I think that would work, and it appears to be a best-of-both worlds approach as described above. I think that a proof-of-concept like this would entail ~80% of the work required to implement the eventual pull request, so if anyone here is keen to dig in, I'd be happy to support you however I can! |
@drewbanin I am interested in getting this implemented for Snowflake as well so there is a chance in the next few weeks to months I might give it a stab. It would be SUPER cool to have our Worksheets 4.0 (Numeracy) be able to have tight integration with column and table comments (or some higher-order construct of internal documentation). |
Thanks @snowflakeseitz - I updated the title and labels on this issue. We should support this on every database we can. I really agree that tight editor integrations would be.... tight! Snowflake makes this pretty tractable -- we can specify comments for multiple columns all at once. BigQuery is going to require usage of the rest API (as far as I understand), and Redshift is an MPP database. We're currently operating with partial support for the Adding this to the next feature release regardless (codenamed Octavius Catto). |
@drewbanin |
persist_docs
config for columns for all core databases
@drewbanin I made some progress on the Snowflake connector side and got my docs to add to columns descriptions would love to set up some time to review before making any more effort |
Thanks @snowflakeseitz! Going to check this PR out in more detail again today! For everyone else - check out this issue which better tracks database-specific implementations for both relation-level and column-level docs persistence: #1573 We'd like to get all of these in for the next release, 0.17.0, so if anyone is interested in picking up support for BQ/pg/Redshift, let us know in the relevant issue! |
persist docs is shipping for all core plugins (at the relation and column level, where possible/applicable) in v0.17.0 🎉 Thanks all for your help with making this happen! |
Describe the feature
Add column descriptions to the BigQuery DDL statement creating a table in BigQuery from the model column definition in the relevant .yml file. Basically addition to https://docs.getdbt.com/docs/bigquery-configs#section-persisting-model-descriptions to add the column descriptions as well.
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#creating_a_new_table
Though one challenge I see here is that dbt might need to derive the type for each column in the result table, not sure if that's being checked right now.
The text was updated successfully, but these errors were encountered: