Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support persist docs on BigQuery: columns #2335

Closed
drewbanin opened this issue Apr 16, 2020 · 9 comments · Fixed by #2402
Closed

Support persist docs on BigQuery: columns #2335

drewbanin opened this issue Apr 16, 2020 · 9 comments · Fixed by #2402
Labels
bigquery enhancement New feature or request

Comments

@drewbanin
Copy link
Contributor

Child issue of #1573

To do:

  • Support persisting docs for columns in tables and views

If the project specifies a column that does not exist in the destination table, the comment on column statement may fail. This is the desired behavior.

BigQuery does not support adding column-level descriptions in DDL unless all columns in the relation are specified. Instead, we should pursue the following approach:

  1. Create a relation without column comments
  2. Fetch the relation schema from the BigQuery API
  3. Update the relation schema with descriptions from the corresponding .yml file
  4. Save the relation schema to persist these changes back to BigQuery
@rpedela-recurly
Copy link

Just like in the generated docs, could the column descriptions also include the schema tests? We are using the schema tests as documentation as well as tests, and I think it would be helpful to show whether a column is UNIQUE, NOT NULL, etc in BigQuery too.

@drewbanin
Copy link
Contributor Author

I think I'd recommend using the dbt documentation website for that purpose. I'm super on board with persisting column descriptions into the database, but I'd hesitate to add a bunch of free-form text generated by dbt at the same time.

What I can imagine is providing tools to generate a list of tests for a column as a string, then include that string in the stated column description. I'm also open to figuring out a way to add this to every column comment so that it happens automatically, but that's out of scope for the feature described in this issue IMO

@jml
Copy link

jml commented Apr 25, 2020

In case it helps, here's the code we use for this:

from typing import Dict, List, Sequence

from google.cloud import bigquery
from google.cloud.bigquery.schema import SchemaField

def update_schema_documentation(
    client: bigquery.Client,
    table_name: str,
    table_description: str,
    field_descriptions: Dict[str, str],
) -> None:  # pragma: no cover
    """Update the documentation for a BigQuery table.

    Known limitations:

    - Does not support nested field definitions

    Feel free to fix these limitations!
    """
    table = client.get_table(table_name)
    table.description = table_description
    table.schema = _document_schema(table_name, table.schema, field_descriptions)
    client.update_table(table, ["description", "schema"])


def _document_schema(
    table_name: str, schema: Sequence[SchemaField], field_descriptions: Dict[str, str]
) -> List[SchemaField]:
    """Create a documented version of the given BigQuery schema."""
    existing_fields = set()
    new_schema = []
    for field in schema:
        description = field_descriptions.get(field.name)
        new_schema.append(
            SchemaField(
                name=field.name,
                field_type=field.field_type,
                mode=field.mode,
                description=description,
                fields=field.fields,
            )
        )
        existing_fields.add(field.name)
    undescribed_fields = existing_fields - set(field_descriptions)
    # TODO: Also raise an exception if we have described fields that don't exist.
    if undescribed_fields:
        # TODO: Raise a more specific exception so we don't have to pass table_name in.
        raise RuntimeError(f"Unexpected fields defined in {table_name}: {undescribed_fields}")
    return new_schema

Tests:

def test_document_schema_undocumented_fields() -> None:
    # _document_schema raises RuntimeError if we are missing documentation for
    # a field.
    with pytest.raises(RuntimeError):
        _document_schema("foo", [SchemaField("name", "INTEGER")], {})


def test_document_schema_empty() -> None:
    # We successfully document an empty schema by returning the same thing.
    assert _document_schema("foo", [], {}) == []


def test_document_schema_adds_docs() -> None:
    # _document_schema returns a new schema with the given descriptions added.
    new_schema = _document_schema(
        "table",
        [SchemaField("foo", "INTEGER"), SchemaField("bar", "STRING")],
        {"foo": "Full of fooness", "bar": "As you'd expect"},
    )
    assert new_schema == [
        SchemaField("foo", "INTEGER", "NULLABLE", "Full of fooness"),
        SchemaField("bar", "STRING", "NULLABLE", "As you'd expect"),
    ]

@drewbanin
Copy link
Contributor Author

thanks @jml - that's super helpful! Are you able to submit a PR with a change like this? It would be super welcomed!

@jml
Copy link

jml commented Apr 28, 2020

I'd genuinely love to, but I doubt I'll have an opportunity to do so any time soon.

@oliverxchen
Copy link

I need this feature, so I'll take a stab at it. But I'm new to the codebase so might be slow... if anybody else wants to work on this please do!

@drewbanin drewbanin added this to the Octavius Catto milestone May 6, 2020
@drewbanin
Copy link
Contributor Author

hey @oliverxchen - we're wrapping up all of the code for v0.17.0, and we anticipate releasing a "release candidate" in the next couple of days! To that end, I submitted #2402 which adds column-level docs persistence for BigQuery.

Just wanted to give you a quick heads up in case this was something you're still chipping away at. Let me know if there's another issue that catches your eye - would be happy to help out with a contribution however I can!

@oliverxchen
Copy link

@drewbanin great, thank you for the feature and up coming release! Sorry I couldn't get this in, was still trying to understand the code. I'll look for other issues I can try to help on

@mathieu-delaunay
Copy link

Hello guys,
Sorry to bother you, but I cannot make persist_docs configuration work on dbt with BigQuery to have persistent column descriptions while I suppose I am using the right syntax:
persist_docs={ "relation": true, "columns": true, "schema": true },
It is working fine for the table descriptions, but not the columns. Is there a known issue about that? Or did I miss something in the documentation?
Thanks a lot for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants