-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add job_id
, location
, project
, and query_id
properties on RowIterator
#1733
Conversation
…on `RowIterator` These can be used to recover the original job metadata when `RowIterator` is the result of a `QueryJob`.
google/cloud/bigquery/table.py
Outdated
@@ -1723,7 +1766,7 @@ def to_arrow_iterable( | |||
|
|||
bqstorage_download = functools.partial( | |||
_pandas_helpers.download_arrow_bqstorage, | |||
self._project, | |||
self._bqstorage_project, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we now have two possible project values, does this mean a RowIterator
can correspond to different projects for storage APIs compared to other APIs?
@property | ||
def _bqstorage_project(self) -> Optional[str]: | ||
"""GCP Project ID where BQ Storage API will bill to (if applicable).""" | ||
client = self.client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what's the difference between self.client.project
and self._project
. When the client isn't of storage type, does self.client.project
still reflect the storage project id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference in other contexts is usually described as "billing project" versus "data project". I've renamed this private property to reflect that. This is easiest to understand in the context of public datasets.
I have permission to query tables in the bigquery-public-data
project and even download the data in bulk with the BigQuery Storage Read API, but I don't have permission to run queries in that project or start a BigQuery Storage Read API session from that project. Instead, I have start the query or bq storage read session in my own project and reference the tables in the other project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow I see! so self._project
is the project for the data/table, and self.client.project
a.k.a. billing project is the user's own project. So I guess this billing project doesn't just apply to storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, though for queries we're using jobs.getQueryResults
, so the project will always be the billing project since that's where temp results for queries are stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tabledata.list
is free, but technically we should be overriding the x-goog-user-project header so that the quota is charged to the billing project. (I don't think we're actually doing that, though)
Co-authored-by: Lingqing Gan <lingqing.gan@gmail.com>
@@ -2113,6 +2113,38 @@ def test_constructor_with_dict_schema(self): | |||
] | |||
self.assertEqual(iterator.schema, expected_schema) | |||
|
|||
def test_job_id_missing(self): | |||
rows = self._make_one() | |||
self.assertIsNone(rows.job_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious about testing philosophy.
For this collection of tests that are pretty much the same (i.e. all the tests that .assertIsNone()) is there a benefit in your mind to having each test broken out separately versus providing a parameterized list of inputs and expected outputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I matched the existing pattern in this test. I do think we could probably combine a lot of these into a single parametrized test.
I worry slightly about the required getattr(property_name)
being too complex for new contributors to understand. In general tests should avoid anything even as complex as an if statement. https://testing.googleblog.com/2014/07/testing-on-toilet-dont-put-logic-in.html In this case, getattr
is a grey area. I wouldn't necessarily count it as "logic"
…on `RowIterator` (googleapis#1733) * feat: add `job_id`, `location`, `project`, and `query_id` properties on `RowIterator` These can be used to recover the original job metadata when `RowIterator` is the result of a `QueryJob`. * rename bqstorage_project to billing project * Update google/cloud/bigquery/table.py Co-authored-by: Lingqing Gan <lingqing.gan@gmail.com> --------- Co-authored-by: Lingqing Gan <lingqing.gan@gmail.com>
These can be used to recover the original job metadata when
RowIterator
is the result of aQueryJob
. This will be especially important after #1722 wherequery_and_wait()
won't return an actualQueryJob
object.Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Towards internal issue 305260214 and related to #589
🦕