-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(#1576) use the information schema on BigQuery #1795
Conversation
@drewbanin, I've run the doc generation but it fails because we have some uppercase letters in dataset names. And SQL query which is marked as failing (a sample of which is presented below) contains dataset names (in all CTEs) both lowercased and quoted. 2: (
3: with tables as (
4: select
5: project_id as table_database,
... |
Really good point @mr2dark! These quoting/capitalization configs can be tricky for us to reconcile in dbt. I just pushed a quick fix that will always quote the project and dataset name, but I don't think that's exactly correct here either. This is certainly tractable for us, but I'll need to spend a little more time here to get it exactly right. If you're so inclined, feel free to grab the latest commit and let me know if it happens to work for you :) |
@drewbanin
Nothing affected the result catalog query that time. It works for me now with the latest commit. It takes about 15 seconds now. |
230c177
to
4c624d0
Compare
test/integration/base.py
Outdated
@@ -1157,6 +1157,13 @@ def __eq__(self, other): | |||
return isinstance(other, float) | |||
|
|||
|
|||
class AnyString: | |||
"""Any string. Use this in assertEqual() calls to assert that it is a float. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... to assert that it is a str.
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! I outlined some structural changes I'd recommend to the adapter to bring it in line with others, and let us use a common get_catalog
everywhere. But just moving the catalog into SQL is so nice!
'database': True, | ||
'schema': True | ||
} | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing get_catalog
's call to _get_cache_schemas
should do this, and if it doesn't we should fix it for BigQuery! You might have to override the BigQueryRelation.information_schema
method in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The big difference from the other get_catalog
implementations here is that the BigQuery information schema is (usually) addressed with:
`project-id`.`dataset`.INFORMATION_SCHEMA.COLUMNS
ie. the information_schema is affixed to a dataset (schema) not a project (database).
The exception is SCHEMATA
which is addressed at the project level:
`project-id`.INFORMATION_SCHEMA.SCHEMATA
Do you still think we should override get_cache_schemas
? I think I might also need to override SchemaSearchMap
to return information schema Relations that BigQuery is happy with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ok. In that case for now how about just implementing _get_cache_schemas
as:
for database, schema in manifest.get_used_schemas():
yield self.Relation.create(
database=database,
schema=schema,
quote_policy={
'database': True,
'schema': True
}
)
In the long run, I think we should probably extend bigquery's Relation subclass to fully account for this quirky interpretation of information_schema
, but we can do that at a later date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple of great minds, thinking alike.
kwargs = {'information_schemas': information_schemas} | ||
table = self.execute_macro(GET_CATALOG_MACRO_NAME, | ||
kwargs=kwargs, | ||
release=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base adapter already does this in get_catalog
col.name: col.name.replace('__', ':') for col in table.columns | ||
}) | ||
|
||
return self._catalog_filter_table(table, manifest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def self._catalog_filter_table(self, table, manifest):
# BigQuery doesn't allow ":" chars in column names -- remap them here.
table = table.rename(column_names={
col.name: col.name.replace('__', ':') for col in table.columns
})
return super()._catalog_filter_table(table, manifest)
relation_type == 'table', | ||
) | ||
return zip(column_names, column_values) | ||
|
||
def get_catalog(self, manifest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of overriding BaseAdapter.get_catalog
, we should use its existing behavior and make BigQuery behave more like other adapters do! I've outlined my thoughts on that below above, in github's rendering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Is it feasible to use the information schema for listing relations, too? (A separate PR, for sure!)
yep! #1275 :D |
Fixes #1576
Work in progress. Use the BigQuery
INFORMATION_SCHEMA
to fetch the catalog. I actually cheat here and use__TABLES__
(notINFORMATION_SCHEMA.TABLES
) because the information in__TABLES__
is a superset of the data inTABLES
(ie.row_count
andsize_bytes
).Couple of things to verify here:
__TABLES__
-- is that appropriate for us to use?To document:
Date sharded tables can be addressed using dbt sources by replacing the date shard suffix with a
*
in the source specification. When this source is referenced from a model, dbt will expand the wildcard to match all date shards in the table. Additionally, the auto-generated dbt documentation website will correctly collect statistics about all of the date shards in this table.