-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Convert read_gbq() function to use google-cloud-python #25
ENH: Convert read_gbq() function to use google-cloud-python #25
Conversation
Codecov Report
@@ Coverage Diff @@
## master #25 +/- ##
===========================================
- Coverage 72.56% 28.97% -43.59%
===========================================
Files 4 4
Lines 1578 1491 -87
===========================================
- Hits 1145 432 -713
- Misses 433 1059 +626
Continue to review full report at Codecov.
|
1381dd7
to
a763cf0
Compare
pandas_gbq/gbq.py
Outdated
@@ -767,6 +770,116 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None, | |||
|
|||
return final_df | |||
|
|||
def from_gbq(query, project_id=None, index_col=None, col_order=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should simply replace read_gbq
, changing the top-level API is a non-starter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can rename mine to read_gbq
and add in the query length/bytes processed info. Want me to delete the old read_gbq
and related code or just rename it?
this need to pass all of the original test |
94065f3
to
2f93f7b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show the output of running the test suite.
pandas_gbq/gbq.py
Outdated
|
||
Parameters | ||
---------- | ||
query : str | ||
SQL-Like Query to return data values | ||
project_id : str | ||
project_id : str (optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
project (str) – the project which the client acts on behalf of. Will be passed when creating a dataset / job. If not passed, falls back to the default inferred from the environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, doc still says that the project will be inferred from environment (https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery/client), but I don't think it does anymore in the latest version. Thus, project_id is now required again, which is incredibly annoying as most of the time, you're probably ok with the query being associated with the same project. Thoughts on whether we should allow the user to specify a default project env variable or other method? (~/.bigqueryrc
looks like it can hold your default project_id, but I don't know if that's deprecated and/or a command-line only implementation: https://cloud.google.com/bigquery/bq-command-line-tool)
pandas_gbq/gbq.py
Outdated
Google BigQuery Account project ID. | ||
index_col : str (optional) | ||
Name of result column to use for index in results DataFrame | ||
col_order : list(str) (optional) | ||
List of BigQuery column names in the desired order for results | ||
DataFrame | ||
reauth : boolean (default False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this now not needed? if so, then simply mark it as deprecated (and raise a warning if its passed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not sure how to implement or if such behavior can be replicated in new api (or is even desired). If any folks have thought, let know. Otherwise, if I can't, will do as you suggest and raise a warning if reauth is passed.
pandas_gbq/gbq.py
Outdated
dialect : {'legacy', 'standard'}, default 'legacy' | ||
'legacy' : Use BigQuery's legacy SQL dialect. | ||
'standard' : Use BigQuery's standard SQL (beta), which is | ||
compliant with the SQL 2011 standard. For more information | ||
see `BigQuery SQL Reference | ||
<https://cloud.google.com/bigquery/sql-reference/>`__ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the change here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can gather from the discussion here (googleapis/google-cloud-python#2765) passing an arbitrary JSON of configuration settings isn't supported in the way it was in the previous python api. As such, we might as well make the passing of configuration settings a little easier with a dict like so, but happy to consider alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: this is one reason I'd prefer to use google-auth library directly. #26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tswast @jasonqng
You can get a Credentials
object from the json with google.oauth2.service_account.Credentials.from_service_account_info(json.loads(key))
Forgive me for not PRing this sort of thing in directly - I'm totally jammed at the moment (forgive me @jreback too...). But happy to help with any questions on this stuff - we've run through a lot of it over here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaximilianR solved the auth with a key in latest commit. is this other concern about the config settings still an open issue to be resolved in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is solved now, thank you @jasonqng
@jasonqng we have to be careful about back-compat here. |
you need to update the |
@jreback Yeah, the back compatibility issues with the authentication is partly why I suggested writing it as a new function, but hopefully we can replicate a form of the pop-up authentication->refresh token with the new api (https://googlecloudplatform.github.io/google-cloud-python/stable/google-cloud-auth.html#user-accounts-3-legged-oauth-2-0-with-a-refresh-token). I might need some help with that if others are more familiar with it. Almost everything else should carry over, so I'm not too concerned with compatibility otherwise. |
pandas_gbq/gbq.py
Outdated
if dialect not in ('legacy', 'standard'): | ||
raise ValueError("'{0}' is not valid for dialect".format(dialect)) | ||
if private_key: | ||
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = private_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it is nice to set environ vars, can't you just include a custom reference to the private_key, if it exists, when creating the Client on row 540?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thejens Fixed and verified it works with my service account json key.
requirements.txt
Outdated
@@ -2,3 +2,5 @@ pandas | |||
httplib2 | |||
google-api-python-client | |||
oauth2client | |||
google-cloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
google-cloud is not the most stable package (I've noticed), I'd require a specific version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine, you can pip specifically
```google-cloud=0.24.0`` (or whatever)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also be more specific and only require the bigquery library with google-cloud-bigquery
: https://github.com/GoogleCloudPlatform/google-cloud-python/tree/master/bigquery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to using google-cloud-bigquery
.
Also, the google-cloud-bigquery
package is still beta, meaning there will likely be breaking changes, so we'd have to pin a very specific version number to maintain stability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed and pinned to google-cloud-bigquery to 0.25.0 since 0.26.0 breaks things
pandas_gbq/gbq.py
Outdated
@@ -603,106 +458,56 @@ def delete_and_recreate_table(self, dataset_id, table_id, table_schema): | |||
table.create(table_id, table_schema) | |||
sleep(delay) | |||
|
|||
|
|||
def _parse_data(schema, rows): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In test_gbq.py, there are several tests that have gbq._parse_data(...)
. Could you update test_gbq.py
as well? See the tests with prefix test_should_return_bigquery*(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, will tackle this weekend hopefully. Thanks!
I recommend closing this PR. The We should revisit this after the |
@jreback In light of @tswast's comment, should we just close this or should we go back to building this as a separate function (e.g. |
we can easily just pin to a specific version of api stability is a concern but in general i don't see this as a big deal no reason to wait for a 1.0 Pandas itself is not even 1.0 and lots of people use / depend on it |
pandas_gbq/gbq.py
Outdated
By default "application default credentials" are used. | ||
|
||
If default application credentials are not found or are restrictive, | ||
user account credentials are used. In this case, you will be asked to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want to remove the user account credentials method? I believe it can be made to work with the google-cloud-biguery library. I'm working on adding a sample to the Google docs that does the user-auth flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GoogleCloudPlatform/python-docs-samples#925
That pull request adds a sample which uses https://pypi.python.org/pypi/google-auth-oauthlib to run a query with user account credentials.
I'll be writing some tests and docs around it, but the code should be easy enough to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tswast Latest commit uses existing GbqConnector to generate appropriate credentials. Let know if anything inappropriate with the way I've implemented.
We still have #23 as an open issue. It would be great to move this forward to address that. I recently added a conda recipe for |
@parthea ok is it worth waiting on this for 0.2.0 to avoid even more changes? |
My initial thought is that the milestone for this PR should be 0.3.0 as thorough testing is required. I think ultimately it depends on how soon we can begin testing this PR and whether we are in a hurry to release 0.2.0. @jasonqng Please could you rebase ? |
Yeah, this PR is still relevant. #39 moves pandas-gbq to use |
@tswast @parthea @jreback Sorry, been swamped these past few months. Hope to scratch out some time this week to incorporate all comments (and also get it to working with queries with large results, which it currently fails with). Just checking, any particular reason to rebase vs a merge? Happy to do former, just haven't done rebase on any collaborative projects so this would be first time. (Haha, worst case scenario, I mess up my branch and I just rewrite and open a new PR branched off new master.) |
@jasonqng Thanks for taking care of this! Rebase is preferred because it will allow you to add commits on top of the latest master which is much nicer to look at during code review. |
Also says which libraries are no longer required, for easier upgrades.
@jreback Could you take a look at the docs changes I made? I've documented both the dependencies for 0.3.0 and what they were before (with notes on how they've changed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change LGTM, but since I made some contributions to this one I'd like one of the other maintainers to also review before we merge it.
@@ -781,6 +656,14 @@ def verify_schema(self, dataset_id, table_id, schema): | |||
key=lambda x: x['name']) | |||
fields_local = sorted(schema['fields'], key=lambda x: x['name']) | |||
|
|||
# Ignore mode when comparing schemas. | |||
for field in fields_local: | |||
if 'mode' in field: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not worth changing, but this could be marginally simpler as field.pop('mode', None)
pandas_gbq/gbq.py
Outdated
dtype_map.get(field['type'].upper(), object) | ||
for field in fields | ||
] | ||
print(fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to print all the fields here?
If we do: elsewhere, this uses self._print
, which can can transition to logging at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Good catch. I added that to debug the "ignore mode when comparing schemas" logic. Removed.
tableId=table_id, | ||
body=body).execute() | ||
except HttpError as ex: | ||
self.client.load_table_from_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so much better than the existing method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's technically a change in behavior (kicks off a load job instead of using the streaming API), but I think the change is small enough to be worth it. Load jobs should be much more reliable for the use case of this library.
rows.append(row_dict) | ||
row_json = row.to_json( | ||
force_ascii=False, date_unit='s', date_format='iso') | ||
rows.append(row_json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't worse than the last version, but this would be much faster if .to_json
was called on the whole table rather than each row, iterating in python
CSV might be even faster given the reduced space (and pandas can't use nesting or structs anyway). But potentially wait until parquet is GA to make the jump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to keep the current behavior for now and do a subsequent PR for any changes like this for performance improvements. I've filed #96 to track the work for speeding up encoding for the to_gbq()
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% re doing that separately
field_type) | ||
for row_num, entries in enumerate(rows): | ||
for col_num in range(len(col_types)): | ||
field_value = entries[col_num] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I'd realized we were looping over all the values in python before. This explains a lot of why exporting a query to a file on GCS and then reading from that file is an order of magnitude faster.
If we could pass rows
directly into DataFrame
that would be much faster, but I'm not sure if that's possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed #97 to track improving the performance in the read case.
field_type) | ||
for row_num, entries in enumerate(rows): | ||
for col_num in range(len(col_types)): | ||
field_value = entries[col_num] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is being called so many times, you may even get a small speed up from eliminating assigning to field_value
(but these are all things that are either the same or better than the existing version)
self._print('Standard price: ${:,.2f} USD\n'.format( | ||
bytes_processed * self.query_price_for_TB)) | ||
bytes_billed * self.query_price_for_TB)) | ||
|
||
self._print('Retrieving results...') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably this is never going to be relevant because the prior part is blocking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no. I think it indeed is less relevant, but actually fetching the rows and constructing the dataframe will take non-zero time, especially for larger result sets.
I took a proper read through, though needs someone like @jreback to approve I think this strictly dominates the existing version. There are a couple of extremely small tweaks that we can do in a follow-up if not now. There are also some areas for huge speed-ups - IIUC the code is currently running through each value in python atm. In line with that: we've built a function for exporting to a file to GCS and loading that in, which works much better for > 1-2m rows. We can do a PR for that if people are interested, in addition to speeding up the current path. |
@jreback I know there's lots going on in pandas, but would be super if you could take a glance at this. A few follow-ups are dependent on this merging. Thanks v much |
sure will look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. some small doc comments that I would do. better to be over explanatory in the whatsnew.
docs/source/changelog.rst
Outdated
0.3.0 / 2017-??-?? | ||
------------------ | ||
|
||
- Use the `google-cloud-bigquery <https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html>`__ library for API calls instead of ``google-api-client`` and ``httplib2``. (:issue:`93`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a more comprehensive note here. show what you used to import / depend on, and what it should be now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can link to install.rst as well
@@ -181,14 +158,6 @@ class QueryTimeout(ValueError): | |||
pass | |||
|
|||
|
|||
class StreamingInsertError(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention that this is eliminated in whatsnew
Thanks! A couple things to clean up before we make a release. I'd like to add a couple tests for some of the other issues we think this PR might address. Plus I'm not sure it makes sense to do a release right before the holidays. |
Description
I've rewritten the current
read_gbq()
function usinggoogle-cloud-python
, which handles the naming of structs and arrays out of the box. For more discussion about this, see: #23.However, because of the fact that google-cloud-python potentially uses different authentication flows and may break existing behavior, I've left the existingread_gbq()
function and and named this new functionfrom_gbq()
. If in the future we are able to reconcile the authentication flows and/or decide to deprecate flows that are not supported ingoogle-cloud-python
, we can rename this toread_gbq()
.UPDATE: As requested in comment by @jreback (https://github.com/pydata/pandas-gbq/pull/25/files/a763cf071813c836b7e00ae40ccf14e93e8fd72b#r110518161), I deleted old
read_gbq()
and named my new functionread_gbq()
, deleting all legacy functions and code.Added in a few lines to requirements file, but I'll leave it to you @jreback to deal with conda dependency issues which you mentioned in Issue 23.
Let know if any questions or if any tests need to be written. You can confirm that it works by running the following:
Confirmed that
col_order
andindex_col
still work(feel free to pull that out into a separate function since there's now redundant code with, and I removed the type conversion lines which appear to be unnecessary (google-cloud-python and/or pandas appears to do the necessary type conversion automatically, even if there are nulls; can confirm by examining the datatypes in the resulting dataframes).read_gbq()
)