Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding namespace fides_meta support for BigQuery datasets #5294

Merged
merged 16 commits into from
Sep 25, 2024

Conversation

galvana
Copy link
Contributor

@galvana galvana commented Sep 17, 2024

Closes PROD-2735

Description Of Changes

Updates the BigQueryConnector to use a namespaced query config. This allows queries to specify the correct project and dataset IDs without relying on the dataset value from the connectionconfig.

Code Changes

  • Updated the dataset label for the BigQuery connector form to read Default BigQuery Dataset
  • Added the BigQueryNamespaceMeta schema (project_id and dataset_id)
  • Updated BigQueryConnector and BigQueryQueryConfig
  • Updated the BigQuery tests to run access and erasures with
    • Connection-level dataset ID (previous functionality)
    • Dataset-level project and dataset IDs (new functionality)

Steps to Confirm

  • list any manual steps for reviewers to confirm the changes

Pre-Merge Checklist

  • All CI Pipelines Succeeded
  • Documentation:
  • Issue Requirements are Met
  • Update CHANGELOG.md

Copy link

vercel bot commented Sep 17, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
fides-plus-nightly ⬜️ Ignored (Inspect) Visit Preview Sep 25, 2024 4:26pm

Copy link

cypress bot commented Sep 17, 2024

fides    Run #10153

Run Properties:  status check passed Passed #10153  •  git commit 0fc2077db2 ℹ️: Merge 7bd8cd31312e237d19ea44be442ef7d0bd831167 into 5036e58d3c8ed111fd58387352b2...
Project fides
Branch Review refs/pull/5294/merge
Run status status check passed Passed #10153
Run duration 00m 39s
Commit git commit 0fc2077db2 ℹ️: Merge 7bd8cd31312e237d19ea44be442ef7d0bd831167 into 5036e58d3c8ed111fd58387352b2...
Committer Adrian Galvan
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 0
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 4
⚠️ You've recorded test results over your free plan limit.
Upgrade your plan to view test results.
View all changes introduced in this branch ↗︎

@galvana galvana added the run unsafe ci checks Runs fides-related CI checks that require sensitive credentials label Sep 18, 2024
from fides.api.schemas.base_class import FidesSchema


class BigQueryNamespaceMeta(FidesSchema):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only using this in a limited scope. It'd be nice to use this to validate datasets when we link them to a specific connection config but I'm removing this from scope for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, that's an interesting idea. somewhat related to my comment on the plus PR here

def __init__(
self,
node: ExecutionNode,
namespace_meta: Optional[BigQueryNamespaceMeta] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the init to include an optional namespace_meta object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make this support a bit more generic? following the pattern we've taken on the D&D side, it feels like we could support (a less strongly-typed) namespace_meta attribute on the SQLQueryConfig generically, and rely on the implementation/subclass to make use of the namespace_meta as it sees fit, i.e. in the datasource-specific way.

the fact that you've already typed the Dataset.fides_meta.namespace as a generic Dict should support this pattern pretty well.

what do you think? maybe it doesn't need to be something we cover now, although i'd kinda like to see it, since i feel like it will only get more cumbersome to implement if we don't update it now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid, I went ahead and made this change. Let me know what you think of my implementation

src/fides/api/service/connectors/sql_connector.py Outdated Show resolved Hide resolved
tests/fixtures/bigquery_fixtures.py Outdated Show resolved Hide resolved
@galvana galvana marked this pull request as ready for review September 18, 2024 23:27
@galvana galvana requested a review from adamsachs September 18, 2024 23:27
Copy link
Contributor

@adamsachs adamsachs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galvana nice clean implementation here, this looks quite good to me! i do have a comment trying to push the envelope a bit on making this a bit more generic such that we can more easily start building the support for this namespace meta for other datasources, which we know we'll have to do.

it's not a must-have if you'd prefer to get this increment in place as you have it, but it's a tweak that i don't think should be too difficult to make right now, and it'll ensure our subsequent implementations for this feature follow a similar pattern. let me know what you think!

for all BigQuery queries to specify the dataset being queried.
"""

project_id: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i think this makes sense, though i will say it's a bit surprising to me - since the BQ dataset does need to be associated with a project, in reality. i know we may not need that for DSR processing, but that seems to be a bit of an impl detail, rather than a fact about the dataset itself - and i feel like the dataset definition should try to describe the dataset itself as accurately as possible.

that being said, i realize that requiring this may break backward compatibility, and in general leave things less flexible, so i'm not strongly recommending we change it. i'm good with it as it is, ultimately - just wanted to throw in my 2 cents and see what you think on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I think it'd be better for this to be explicit 👍

def __init__(
self,
node: ExecutionNode,
namespace_meta: Optional[BigQueryNamespaceMeta] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make this support a bit more generic? following the pattern we've taken on the D&D side, it feels like we could support (a less strongly-typed) namespace_meta attribute on the SQLQueryConfig generically, and rely on the implementation/subclass to make use of the namespace_meta as it sees fit, i.e. in the datasource-specific way.

the fact that you've already typed the Dataset.fides_meta.namespace as a generic Dict should support this pattern pretty well.

what do you think? maybe it doesn't need to be something we cover now, although i'd kinda like to see it, since i feel like it will only get more cumbersome to implement if we don't update it now

src/fides/api/service/connectors/query_config.py Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
from fides.api.schemas.base_class import FidesSchema


class BigQueryNamespaceMeta(FidesSchema):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, that's an interesting idea. somewhat related to my comment on the plus PR here

namespace_meta: Optional[BigQueryNamespaceMeta] = None

if raw_meta := SQLConnector.get_namespace_meta(db, node.address.dataset):
namespace_meta = BigQueryNamespaceMeta(**raw_meta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right so what i'm thinking is that instead of "casting" the raw_meta here, we'd cast it within the scope of the BigQueryQueryConfig, such that we could initialize a generic namespace_meta attribute generically on the SQLConnector base class

Copy link
Contributor Author

@galvana galvana Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to this

db: Session = Session.object_session(self.configuration)
return BigQueryQueryConfig(
    node, SQLConnector.get_namespace_meta(db, node.address.dataset)
)

The SQLLikeQueryConfig super class validates the meta dict based on the namespace_meta_schema defined at the BigQueryQueryConfig

class SQLLikeQueryConfig(QueryConfig[T], ABC):
    """
    Abstract query config for SQL-like languages (that may not be strictly SQL).
    """

    namespace_meta_schema: Optional[Type[NamespaceMeta]] = None

    def __init__(self, node: ExecutionNode, namespace_meta: Optional[Dict] = None):
        super().__init__(node)
        self.namespace_meta: Optional[NamespaceMeta] = None

        if namespace_meta is not None:
            if self.namespace_meta_schema is None:
                raise MissingNamespaceSchemaException(
                    f"{self.__class__.__name__} must define a namespace_meta_schema when namespace_meta is provided."
                )
            try:
                self.namespace_meta = self.namespace_meta_schema.model_validate(
                    namespace_meta
                )
            except ValidationError as exc:
                raise ValueError(f"Invalid namespace_meta: {exc}")

Copy link
Contributor Author

@galvana galvana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @adamsachs. Here's my attempt at making some of the namespace meta logic a bit more generic.

for all BigQuery queries to specify the dataset being queried.
"""

project_id: Optional[str] = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I think it'd be better for this to be explicit 👍

def __init__(
self,
node: ExecutionNode,
namespace_meta: Optional[BigQueryNamespaceMeta] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid, I went ahead and made this change. Let me know what you think of my implementation

namespace_meta: Optional[BigQueryNamespaceMeta] = None

if raw_meta := SQLConnector.get_namespace_meta(db, node.address.dataset):
namespace_meta = BigQueryNamespaceMeta(**raw_meta)
Copy link
Contributor Author

@galvana galvana Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to this

db: Session = Session.object_session(self.configuration)
return BigQueryQueryConfig(
    node, SQLConnector.get_namespace_meta(db, node.address.dataset)
)

The SQLLikeQueryConfig super class validates the meta dict based on the namespace_meta_schema defined at the BigQueryQueryConfig

class SQLLikeQueryConfig(QueryConfig[T], ABC):
    """
    Abstract query config for SQL-like languages (that may not be strictly SQL).
    """

    namespace_meta_schema: Optional[Type[NamespaceMeta]] = None

    def __init__(self, node: ExecutionNode, namespace_meta: Optional[Dict] = None):
        super().__init__(node)
        self.namespace_meta: Optional[NamespaceMeta] = None

        if namespace_meta is not None:
            if self.namespace_meta_schema is None:
                raise MissingNamespaceSchemaException(
                    f"{self.__class__.__name__} must define a namespace_meta_schema when namespace_meta is provided."
                )
            try:
                self.namespace_meta = self.namespace_meta_schema.model_validate(
                    namespace_meta
                )
            except ValidationError as exc:
                raise ValueError(f"Invalid namespace_meta: {exc}")

Copy link
Contributor

@adamsachs adamsachs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice touch up @galvana! this looks good to go from my POV 👍

@@ -832,8 +844,9 @@ def _generate_table_name(self) -> str:

table_name = self.node.collection.name
if self.namespace_meta:
table_name = f"{self.namespace_meta.dataset_id}.{table_name}"
if project_id := self.namespace_meta.project_id:
bigquery_namespace_meta = cast(BigQueryNamespaceMeta, self.namespace_meta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is to satisfy mypy, yeah? i feel like there should be a way to get the self.namespace_meta more properly typed using generics, which feels a bit cleaner. i.e., you've already actually cast this to the right strongly-typed class in the constructor, so it just seems odd you need to 'cast' it again here.

but i also realize that generics can get a bit verbose and convoluted, especially in python...so all good keeping it as you've got it here. probably just my java-trained instincts getting the best of me anyway 😅

@galvana galvana merged commit 7cd8f46 into main Sep 25, 2024
14 checks passed
@galvana galvana deleted the PROD-2735-add-namespace-meta-to-bigquery-datasets branch September 25, 2024 16:26
Copy link

cypress bot commented Sep 25, 2024

fides    Run #10154

Run Properties:  status check passed Passed #10154  •  git commit 7cd8f46d9a: Adding namespace fides_meta support for BigQuery datasets (#5294)
Project fides
Branch Review main
Run status status check passed Passed #10154
Run duration 00m 41s
Commit git commit 7cd8f46d9a: Adding namespace fides_meta support for BigQuery datasets (#5294)
Committer Adrian Galvan
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 0
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 4
⚠️ You've recorded test results over your free plan limit.
Upgrade your plan to view test results.
View all changes introduced in this branch ↗︎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run unsafe ci checks Runs fides-related CI checks that require sensitive credentials
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants