Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigquery catalog generation (#830) #857

Merged
merged 25 commits into from
Jul 18, 2018

Conversation

beckjake
Copy link
Contributor

@beckjake beckjake commented Jul 17, 2018

Add catalog/manifest generation and tests for BigQuery, fixes #830 . This one was easy, Drew already did the hard work in the issue comments!

Note that the commit history looks terrible because of some ugly branching and merging, but the diff is right.

@beckjake beckjake requested a review from drewbanin July 17, 2018 14:54
@@ -287,3 +287,63 @@ def test__snowflake__run_and_generate(self):

self.verify_catalog(expected_catalog)
self.verify_manifest()

@attr(type='bigquery')
def test__bigquery__run_and_generate(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should add another test for a table with nested records. i'm not sure how to do that in the model SQL but i think this is potentially a difference in how BQ works vs. other warehouses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great point @cmcarthur!

I usually test with an array of structs, eg:

select
  [
    struct(
      'Drew' as first_name,
      'Banin' as last_name
    ),
    struct(
      'Connor' as first_name,
      'McArthur' as last_name
    )
  ] as people

I don't think we'll be able to use a CSV file to construct nested or repeated records.

Regarding behavior, I think there are two options:

BigQuery exposes nested fields (ie. the structs above) as:

people             RECORD (REPEATED)
people.first_name  STRING
people.last_name   STRING

We have a function to flatten the nested fields here: https://github.com/fishtown-analytics/dbt/blob/development/dbt/schema.py#L103

The other option to to just show a single field, but render out a complete type. For people above, that type would be:

ARRAY<STRUCT<first_name STRING, last_name STRING>>

There is new logic to build this type name in my bq-incremental-and-archive branch, over here: https://github.com/fishtown-analytics/dbt/blob/feature/bq-incremental-and-archive/dbt/schema.py#L170

I think the first option is probably preferable, but I'm open to discussion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I've implemented this test and fixed the code sucessfully - good thing you asked, it definitely did not work and I fixed it. I went with @drewbanin's flatten() suggestion.

@drewbanin
Copy link
Contributor

@beckjake :shipit:

@beckjake beckjake merged commit e5bc9c0 into dev/isaac-asimov Jul 18, 2018
@beckjake beckjake deleted the bigquery-catalog-generation branch July 18, 2018 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants