Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow BigQuery to default on project name #2908

Merged
merged 12 commits into from
Dec 3, 2020
Merged

Allow BigQuery to default on project name #2908

merged 12 commits into from
Dec 3, 2020

Conversation

max-sixty
Copy link
Contributor

@max-sixty max-sixty commented Nov 24, 2020

resolves #2828

Description

As discussed in the issue.

I couldn't quite get my head around the tests; seems like there's lots of nesting there — what's the easiest way to test this?

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Nov 24, 2020
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start @max-sixty!

I ran into an issue testing this out locally.

what's the easiest way to test this?

Fair question. We don't use oauth for our integration tests, since both CircleCI + Azure DevOps connect via service account. I'll think about if we have any way of mocking this.

def get_bigquery_defaults(cls):
""" Returns (credentials, project_id) """
# Cached, because the underlying implementation shells out.
return google.auth.default(scopes=cls.SCOPE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I test this locally, cls.get_bigquery_defaults() seems to return

(<google.oauth2.credentials.Credentials object at 0x112630ca0>, None)

resulting in the database remaining None below, and returning the following error from the google cloud client:

OSError: Project was not passed and could not be determined from the environment.

I don't think this is an issue on my end, but curious to hear what might be different, if you've managed to get this working locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What auth are you using? I'm using ADC — i.e. running gcloud auth login --update-adc.

What do you get from:

In [1]: from google.auth import default

In [2]: default()
Out[2]: (<google.oauth2.credentials.Credentials at 0x10a19e850>, '{foo_project}')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, this was definitely on me:

In [1]: from google.auth import default

In [2]: default()
Out[2]: (<google.oauth2.credentials.Credentials at 0x10a19e850>, None)

I originally set up local gcloud oauth a while ago, so I updated the auth as you recommended, and the second arg now returns my default project.

Even so, I'm still running into this issue when dbt tries to get information about the None database by running list_None:

google.api_core.exceptions.BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/None/datasets?maxResults=10000: Invalid project ID 'None'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

I think we may need to fill in the missing database/project earlier on—ideally as a __post_init__ on BigQueryCredentials—because dbt needs the database value in more places than just its connection client. This would also be relevant in case users want to make use of the Jinja context variable target.database/target.project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, good point. I pushed a change. Generally I would have that sort of function at the module level; but put them as methods as it seems to be fairly class-focused atm. Let me know which you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's list_None? What's the test you're running locally to test this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the first things dbt does at the start of a compile/run is grab metadata to populate its relational cache. It will log something to the effect of:

Acquiring new bigquery connection "list_[project_name]".

This was showing up for me as:

Acquiring new bigquery connection "list_None".

But now it looks good! target.database/target.project are also working as expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great — thanks for checking @jtcohen6 !

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! Could you add a changelog entry, and add yourself to the list of contributors?

Looking at #2805, we added some unit tests to mock the BQ connection and ensure that the types and required fields all lined up. We might do well to add the same here, to ensure that dbt is happy with a profile that's missing a database/project.

I don't think we have a way to integration-test this, given our current CI setup. Although it's hardly a substitute for automated testing, I'll start using this default-project profile setup for my local connections to BigQuery.

Also — though I appreciate that you separated out the one-line fix in #2907 — I'd be fine with your pulling that in here, since it more or less touches the same code.

def get_bigquery_defaults(cls):
""" Returns (credentials, project_id) """
# Cached, because the underlying implementation shells out.
return google.auth.default(scopes=cls.SCOPE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the first things dbt does at the start of a compile/run is grab metadata to populate its relational cache. It will log something to the effect of:

Acquiring new bigquery connection "list_[project_name]".

This was showing up for me as:

Acquiring new bigquery connection "list_None".

But now it looks good! target.database/target.project are also working as expected

database: str
# Most DBs have this as required, but BigQuery is Optional, and mypy
# doesn't seem to allow overriding the type in `BigQueryCredentials`
database: Optional[str]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also set database to Optional[str] in dbt-spark, which inherits/replaces the base Credentials dataclass, too. I wonder why mypy doesn't mind it there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, thanks @jtcohen6 . I think I'm confusing errors — the error it issued was actually Attributes without a default cannot follow attributes with one. Let's try the latest push

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also does repo actually run mypy?
I see a lot of type annotations in the code, but also a lot of errors when running mypy? Mostly in the test path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW — if these were in the mypy configs explicitly, then mypy wouldn't raise errors in an editor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

""" Returns (credentials, project_id) """
# This method is copied from ` BigQueryConnectionManager`, because it's
# required in both classes.
# We could move this & the scopes to the module level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I would have that sort of function at the module level; but put them as methods as it seems to be fairly class-focused atm. Let me know which you prefer.

Good point. @gshank @kwigley does either of you have a preference here?

Copy link
Contributor

@kwigley kwigley Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to move this to module level, there's no state involved and it's shared 👍 go for it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like there's just a few small pep8 errors:

plugins/bigquery/dbt/adapters/bigquery/connections.py:48:1: E302 expected 2 blank lines, found 1
plugins/bigquery/dbt/adapters/bigquery/connections.py:50:4: W291 trailing whitespace
plugins/bigquery/dbt/adapters/bigquery/connections.py:52:1: W293 blank line contains whitespace
plugins/bigquery/dbt/adapters/bigquery/connections.py:53:52: W291 trailing whitespace

max-sixty and others added 2 commits December 2, 2020 10:41
@max-sixty
Copy link
Contributor Author

@jtcohen6
Copy link
Contributor

jtcohen6 commented Dec 2, 2020

Is this failure a flake or causal? https://app.circleci.com/pipelines/github/fishtown-analytics/dbt/1600/workflows/12269ee7-f010-40a0-bcb5-ba2a957d19fe/jobs/33567

Looks like this was just a fluke. Re-running from failed now

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @max-sixty! Thanks for delving into our testing flows as well. Just needs a changelog entry and then this is ready to ship

@max-sixty
Copy link
Contributor Author

Great, done!

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool! thanks for the contribution :)

@jtcohen6 jtcohen6 merged commit e7c2422 into dbt-labs:dev/kiyoshi-kuromiya Dec 3, 2020
@max-sixty
Copy link
Contributor Author

Cheers for the guidance @jtcohen6 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use default google cloud project if not supplied?
3 participants