-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding namespace fides_meta support for BigQuery datasets #5294
Adding namespace fides_meta support for BigQuery datasets #5294
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
fides Run #10153
Run Properties:
|
Project |
fides
|
Branch Review |
refs/pull/5294/merge
|
Run status |
Passed #10153
|
Run duration | 00m 39s |
Commit |
0fc2077db2 ℹ️: Merge 7bd8cd31312e237d19ea44be442ef7d0bd831167 into 5036e58d3c8ed111fd58387352b2...
|
Committer | Adrian Galvan |
View all properties for this run ↗︎ |
Test results | |
---|---|
Failures |
0
|
Flaky |
0
|
Pending |
0
|
Skipped |
0
|
Passing |
4
|
Upgrade your plan to view test results. | |
View all changes introduced in this branch ↗︎ |
from fides.api.schemas.base_class import FidesSchema | ||
|
||
|
||
class BigQueryNamespaceMeta(FidesSchema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only using this in a limited scope. It'd be nice to use this to validate datasets when we link them to a specific connection config but I'm removing this from scope for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, that's an interesting idea. somewhat related to my comment on the plus PR here
def __init__( | ||
self, | ||
node: ExecutionNode, | ||
namespace_meta: Optional[BigQueryNamespaceMeta] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the init to include an optional namespace_meta
object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we make this support a bit more generic? following the pattern we've taken on the D&D side, it feels like we could support (a less strongly-typed) namespace_meta
attribute on the SQLQueryConfig
generically, and rely on the implementation/subclass to make use of the namespace_meta
as it sees fit, i.e. in the datasource-specific way.
the fact that you've already typed the Dataset.fides_meta.namespace
as a generic Dict
should support this pattern pretty well.
what do you think? maybe it doesn't need to be something we cover now, although i'd kinda like to see it, since i feel like it will only get more cumbersome to implement if we don't update it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is valid, I went ahead and made this change. Let me know what you think of my implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@galvana nice clean implementation here, this looks quite good to me! i do have a comment trying to push the envelope a bit on making this a bit more generic such that we can more easily start building the support for this namespace meta for other datasources, which we know we'll have to do.
it's not a must-have if you'd prefer to get this increment in place as you have it, but it's a tweak that i don't think should be too difficult to make right now, and it'll ensure our subsequent implementations for this feature follow a similar pattern. let me know what you think!
src/fides/api/schemas/connection_configuration/connection_secrets_bigquery.py
Show resolved
Hide resolved
for all BigQuery queries to specify the dataset being queried. | ||
""" | ||
|
||
project_id: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i think this makes sense, though i will say it's a bit surprising to me - since the BQ dataset does need to be associated with a project, in reality. i know we may not need that for DSR processing, but that seems to be a bit of an impl detail, rather than a fact about the dataset itself - and i feel like the dataset definition should try to describe the dataset itself as accurately as possible.
that being said, i realize that requiring this may break backward compatibility, and in general leave things less flexible, so i'm not strongly recommending we change it. i'm good with it as it is, ultimately - just wanted to throw in my 2 cents and see what you think on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I think it'd be better for this to be explicit 👍
def __init__( | ||
self, | ||
node: ExecutionNode, | ||
namespace_meta: Optional[BigQueryNamespaceMeta] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we make this support a bit more generic? following the pattern we've taken on the D&D side, it feels like we could support (a less strongly-typed) namespace_meta
attribute on the SQLQueryConfig
generically, and rely on the implementation/subclass to make use of the namespace_meta
as it sees fit, i.e. in the datasource-specific way.
the fact that you've already typed the Dataset.fides_meta.namespace
as a generic Dict
should support this pattern pretty well.
what do you think? maybe it doesn't need to be something we cover now, although i'd kinda like to see it, since i feel like it will only get more cumbersome to implement if we don't update it now
from fides.api.schemas.base_class import FidesSchema | ||
|
||
|
||
class BigQueryNamespaceMeta(FidesSchema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, that's an interesting idea. somewhat related to my comment on the plus PR here
namespace_meta: Optional[BigQueryNamespaceMeta] = None | ||
|
||
if raw_meta := SQLConnector.get_namespace_meta(db, node.address.dataset): | ||
namespace_meta = BigQueryNamespaceMeta(**raw_meta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right so what i'm thinking is that instead of "casting" the raw_meta here, we'd cast it within the scope of the BigQueryQueryConfig
, such that we could initialize a generic namespace_meta
attribute generically on the SQLConnector
base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it to this
db: Session = Session.object_session(self.configuration)
return BigQueryQueryConfig(
node, SQLConnector.get_namespace_meta(db, node.address.dataset)
)
The SQLLikeQueryConfig
super class validates the meta dict based on the namespace_meta_schema
defined at the BigQueryQueryConfig
class SQLLikeQueryConfig(QueryConfig[T], ABC):
"""
Abstract query config for SQL-like languages (that may not be strictly SQL).
"""
namespace_meta_schema: Optional[Type[NamespaceMeta]] = None
def __init__(self, node: ExecutionNode, namespace_meta: Optional[Dict] = None):
super().__init__(node)
self.namespace_meta: Optional[NamespaceMeta] = None
if namespace_meta is not None:
if self.namespace_meta_schema is None:
raise MissingNamespaceSchemaException(
f"{self.__class__.__name__} must define a namespace_meta_schema when namespace_meta is provided."
)
try:
self.namespace_meta = self.namespace_meta_schema.model_validate(
namespace_meta
)
except ValidationError as exc:
raise ValueError(f"Invalid namespace_meta: {exc}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @adamsachs. Here's my attempt at making some of the namespace meta logic a bit more generic.
for all BigQuery queries to specify the dataset being queried. | ||
""" | ||
|
||
project_id: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I think it'd be better for this to be explicit 👍
def __init__( | ||
self, | ||
node: ExecutionNode, | ||
namespace_meta: Optional[BigQueryNamespaceMeta] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is valid, I went ahead and made this change. Let me know what you think of my implementation
namespace_meta: Optional[BigQueryNamespaceMeta] = None | ||
|
||
if raw_meta := SQLConnector.get_namespace_meta(db, node.address.dataset): | ||
namespace_meta = BigQueryNamespaceMeta(**raw_meta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it to this
db: Session = Session.object_session(self.configuration)
return BigQueryQueryConfig(
node, SQLConnector.get_namespace_meta(db, node.address.dataset)
)
The SQLLikeQueryConfig
super class validates the meta dict based on the namespace_meta_schema
defined at the BigQueryQueryConfig
class SQLLikeQueryConfig(QueryConfig[T], ABC):
"""
Abstract query config for SQL-like languages (that may not be strictly SQL).
"""
namespace_meta_schema: Optional[Type[NamespaceMeta]] = None
def __init__(self, node: ExecutionNode, namespace_meta: Optional[Dict] = None):
super().__init__(node)
self.namespace_meta: Optional[NamespaceMeta] = None
if namespace_meta is not None:
if self.namespace_meta_schema is None:
raise MissingNamespaceSchemaException(
f"{self.__class__.__name__} must define a namespace_meta_schema when namespace_meta is provided."
)
try:
self.namespace_meta = self.namespace_meta_schema.model_validate(
namespace_meta
)
except ValidationError as exc:
raise ValueError(f"Invalid namespace_meta: {exc}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice touch up @galvana! this looks good to go from my POV 👍
@@ -832,8 +844,9 @@ def _generate_table_name(self) -> str: | |||
|
|||
table_name = self.node.collection.name | |||
if self.namespace_meta: | |||
table_name = f"{self.namespace_meta.dataset_id}.{table_name}" | |||
if project_id := self.namespace_meta.project_id: | |||
bigquery_namespace_meta = cast(BigQueryNamespaceMeta, self.namespace_meta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is to satisfy mypy, yeah? i feel like there should be a way to get the self.namespace_meta
more properly typed using generics, which feels a bit cleaner. i.e., you've already actually cast this to the right strongly-typed class in the constructor, so it just seems odd you need to 'cast' it again here.
but i also realize that generics can get a bit verbose and convoluted, especially in python...so all good keeping it as you've got it here. probably just my java-trained instincts getting the best of me anyway 😅
fides Run #10154
Run Properties:
|
Project |
fides
|
Branch Review |
main
|
Run status |
Passed #10154
|
Run duration | 00m 41s |
Commit |
7cd8f46d9a: Adding namespace fides_meta support for BigQuery datasets (#5294)
|
Committer | Adrian Galvan |
View all properties for this run ↗︎ |
Test results | |
---|---|
Failures |
0
|
Flaky |
0
|
Pending |
0
|
Skipped |
0
|
Passing |
4
|
Upgrade your plan to view test results. | |
View all changes introduced in this branch ↗︎ |
Closes PROD-2735
Description Of Changes
Updates the
BigQueryConnector
to use a namespaced query config. This allows queries to specify the correct project and dataset IDs without relying on the dataset value from theconnectionconfig
.Code Changes
Default BigQuery Dataset
BigQueryNamespaceMeta
schema (project_id
anddataset_id
)BigQueryConnector
andBigQueryQueryConfig
Steps to Confirm
Pre-Merge Checklist
CHANGELOG.md