Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support persist_docs config for columns for all core databases #1722

Closed
darmorashivam opened this issue Sep 4, 2019 · 13 comments
Closed

Support persist_docs config for columns for all core databases #1722

darmorashivam opened this issue Sep 4, 2019 · 13 comments

Comments

@darmorashivam
Copy link

darmorashivam commented Sep 4, 2019

Describe the feature

Add column descriptions to the BigQuery DDL statement creating a table in BigQuery from the model column definition in the relevant .yml file. Basically addition to https://docs.getdbt.com/docs/bigquery-configs#section-persisting-model-descriptions to add the column descriptions as well.

https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#creating_a_new_table

Though one challenge I see here is that dbt might need to derive the type for each column in the result table, not sure if that's being checked right now.

@darmorashivam darmorashivam added enhancement New feature or request triage labels Sep 4, 2019
@drewbanin drewbanin removed the triage label Sep 4, 2019
@drewbanin
Copy link
Contributor

Thanks @shivam-darmora

@mferryRV
Copy link

mferryRV commented Nov 3, 2019

So I was curious what it would take to do this and found it was beyond my ability to contribute, but I did some spelunking that I'd like to document for anybody picking this up.

The first thing to note is that BigQuery Data Definition Language statements do contain an option for the column schema which can contain a column list option for descriptions. "That means that CREATE SQL statements could include column descriptions, but it would require us to change how the DDL statement is compiled for BigQuery," I said to myself, just before searching the codebase for 20 minutes and not finding any DDL statements.

It looks like dbt is using BigQuery API functionality instead of DDL statements to build tables in two ways:

  • For views, dbt uses the Table method documented here and just submits SQL. This Table function actually takes an optional schema parameter that is built of many SchemaFields, each of which can contain a column description! While we don't use this today, it suggests that views could have column descriptions if we pass that information into the create_view() function in the BigQuery connector.
  • For non-partitioned materialised tables, dbt submits a SQL query job along with a JobConfig. Based on the JobConfig documentation, we can't pass schema information into the query job, so we would have to make a second API call to upload column descriptions once the table has been created. Bummer.
  • For day-partitioned materialised tables, dbt appears to use the Table API again like it did for views, so it might also take schemas like views could, but I'm not as confident about that one.

This change is beyond my skill level because it might prompt a refactor. Should we use DDL statements across the board? Should we use the Table method across the board? I really don't know.

There are also other ways that dbt interacts with BQ tables beyond just creating new ones, and we would need to account for column descriptions in those as well. It looks like dbt's native Column class is sometimes converted to the google.cloud.bigquery.schema.SchemaField class for BigQuery API calls. The init function for this class does have a parameter for description, but we're not currently calling it.

This SchemaField init function is called when:

To determine if descriptions could work with these methods, we need to answer a few questions:

Based on my spelunking, the answers are roadblocks:

  • The BigQueryColumn class does not init with a description, though it could if the dbt.adapters.base.Column definition contained a description, which it apparently does not. It seems like this would require a change to the Column class, so the feature is no longer a BigQuery-specific addition and I wouldn't feel comfortable contributing without some discussion of how column descriptions work on other platforms.
  • The agate table is a dataframe defined by the agate Python library and its table method does not allow for column descriptions. Adding descriptions here would require a different data structure, and again that change would not be BigQuery-specific. That said, I don't know what dbt does with a dataframe read in from BigQuery, but I imagine it's focused on the contents of the table not the contents of the column description, so changing this might not be useful.

In closing, adding column descriptions looks like a PITA but hopefully somebody reads this, says "actually that doesn't sound too bad", and actually goes for it 🙏 🙏 🙏

@drewbanin
Copy link
Contributor

Hey @mferryRV - thanks for this very thoughtful and well-researched comment! My ultimate takeaway is indeed "actually that doesn't sound too bad" :p

Couple of things to note:
The use of the table API is a vestige of BigQuery's historically poor support for date partitioned tables. Before column partitioning was released, dbt needed to create an ingestion-time partitioned table (using the table API) and then create partitions with a date decorator for each date specified in the model. We intend to remove this bit of functionality in a future release - column partitioning is unreservedly better for dbt's needs.

dbt does execute DDL to create tables and views on BigQuery! This DDL looks something like:

create or replace table dbt_dbanin.debug_comments as
(
  select
    1 as id,
    'drew' as name
);

Instead, dbt could build a DDL query that looks more like:

create or replace table dbt_dbanin.debug_comments (
  id int64 options(description="abc 123"),
  name string options(description="def 456")
)
options (
  description = "my table description"
)
as
(
  select
    1 as id,
    'drew' as name
);

This example will create a table with descriptions for both the specified columns and the table itself. The really challenging thing here is that all of the columns in the table need to be specified in the table schema. We could not, for instance, supply the schema for the id column but omit the name column in the example above.

This sort of leave us with two options:

  1. Require users to specify every column and their types in schema.yml files for models which should contain column-level comments
  2. Do some SQL magic to "guess" the types of the specified columns and then merge those column schemas with any user-provided entries from schema.yml files

Neither of these options sound super pleasant IMO. The first option is really tractable for us to implement, but puts a pretty big burden on dbt users! There are some places where I think enumerating a full schema for a model is a good idea, but I hesitate to require something that verbose just to get column descriptions working.

The other broad approach is to create the table, then to iterate through the list of columns that contain descriptions in the schema.yml file. I believe we can fetch the table schema from BigQuery, modify it to include user-supplied comments, then push those descriptions back to BigQuery. Other databases support alter table comment on columns id to '....' syntax - this approach is somewhat analogous and would provide us a path forwards for implementing similar functionality on other databases.

We can totally add a comment or description field to the Column class, or we could implement it specifically for BigQueryColumns - either approach is appropriate here I think. Long term, I'd like to support this functionality on all databases where it's tractable.

Last, you can ignore Agate in this context - I don't think it's going to play much of a role in the implementation here.

Thanks! Super happy to follow up on this - let me know if there's anything else I can clarify, or if there's anything important I missed here :)

@mferryRV
Copy link

mferryRV commented Nov 19, 2019

Thanks for digging into this, Drew!

My feeling is that creating the table and then iterating through the columns to add descriptions introduces the feature without making the functionality less intuitive. For example, I'd hate to see errors like:
Sorry, you need to specify column descriptions for all columns
or
Sorry, we guessed the column type wrong and now it doesn't work

Iterating through after the table already exists would make for more intuitive errors - the table still gets built and the secondary concern of column documentation is addressed later. It also feels like it might be faster than interpreting the column type for every column.

@drewbanin
Copy link
Contributor

I think I'm with you here @mferryRV! Requiring users to specify all of the columns in a table is a good feature in its own right, but it's definitely something that should be opted into #1570

I think we can implement this differently on different databases. BigQuery will probably let us update the column comments with a single API call, whereas we might need to dispatch one alter table ... comment on column .... is '...' query per column. Such is life!

@bodschut
Copy link
Contributor

Hi

Did someone make any progress on this topic? We are also looking into how to provide column level descriptions to our dbt models that we deploy in BigQuery.

thanks
Bob

@drewbanin
Copy link
Contributor

Hi @bodschut - no progress to-date that I'm aware of. My current thinking is that we should:

  1. Create the table without column comments
  2. Fetch the table schema from the BigQuery API
  3. Update the table schema with descriptions from the corresponding .yml file
  4. Save the table schema to persist these changes back to BigQuery

I think that would work, and it appears to be a best-of-both worlds approach as described above. I think that a proof-of-concept like this would entail ~80% of the work required to implement the eventual pull request, so if anyone here is keen to dig in, I'd be happy to support you however I can!

@sfc-gh-aseitz
Copy link
Contributor

@drewbanin I am interested in getting this implemented for Snowflake as well so there is a chance in the next few weeks to months I might give it a stab.

It would be SUPER cool to have our Worksheets 4.0 (Numeracy) be able to have tight integration with column and table comments (or some higher-order construct of internal documentation).

@drewbanin drewbanin changed the title Provide a way to add description to columns in BigQuery tables. Provide a way to add description to columns in tables. Feb 21, 2020
@drewbanin
Copy link
Contributor

Thanks @snowflakeseitz - I updated the title and labels on this issue. We should support this on every database we can. I really agree that tight editor integrations would be.... tight!

Snowflake makes this pretty tractable -- we can specify comments for multiple columns all at once. BigQuery is going to require usage of the rest API (as far as I understand), and Redshift is an MPP database.

We're currently operating with partial support for the persist_docs config across the core-supported database. I'd like to just take the time to round that feature out and support this everywhere we can! @snowflakeseitz if you're interested in sending through the Snowflake piece of this, we'd certainly love to merge that PR :)

Adding this to the next feature release regardless (codenamed Octavius Catto).

@drewbanin drewbanin added this to the Octavius Catto milestone Feb 21, 2020
@mjhoshea
Copy link

@drewbanin
This is how I have tackled it. I am creating the table via dbt and then as part of an airflow dag am getting the schema and running a script. Let me know how I can help getting this feature into dbt :)

@drewbanin drewbanin changed the title Provide a way to add description to columns in tables. Support perist_docs for columns for all databases Apr 6, 2020
@drewbanin drewbanin changed the title Support perist_docs for columns for all databases Support persist_docs config for columns for all core databases Apr 6, 2020
@sfc-gh-aseitz
Copy link
Contributor

@drewbanin I made some progress on the Snowflake connector side and got my docs to add to columns descriptions would love to set up some time to review before making any more effort

@drewbanin
Copy link
Contributor

Thanks @snowflakeseitz! Going to check this PR out in more detail again today!

For everyone else - check out this issue which better tracks database-specific implementations for both relation-level and column-level docs persistence: #1573

We'd like to get all of these in for the next release, 0.17.0, so if anyone is interested in picking up support for BQ/pg/Redshift, let us know in the relevant issue!

@drewbanin drewbanin modified the milestones: Octavius Catto, dbt-next Apr 29, 2020
@drewbanin
Copy link
Contributor

persist docs is shipping for all core plugins (at the relation and column level, where possible/applicable) in v0.17.0 🎉

Thanks all for your help with making this happen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants