-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: Get query results as pandas DataFrame #4354
Conversation
I don't personally like the idea of an optional dependency. I guess I try to think of Python as a typed language as much as possible (and hopefully someday it will be). Maybe instead of actually returning a pandas DataFrame, you could (a) expose whatever currently hidden information is needed to efficiently create one, and (b) add a different module that depends on this one and pandas, and creates a DataFrame. |
I don't particularly like it too much, but in this case, the
(you can name the "extra" anything you like, I just chose |
@@ -1949,6 +1949,18 @@ def result(self, timeout=None, retry=DEFAULT_RETRY): | |||
return self._client.list_rows(dest_table, selected_fields=schema, | |||
retry=retry) | |||
|
|||
def to_dataframe(self): |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
return pd.DataFrame(rows, columns=column_headers) | ||
|
||
def __iter__(self): | ||
return iter(self.result()) |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
bigquery/tests/system.py
Outdated
@@ -1235,6 +1235,30 @@ def test_query_future(self): | |||
row_tuples = [r.values() for r in iterator] | |||
self.assertEqual(row_tuples, [(1,)]) | |||
|
|||
def test_query_iter(self): |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
bigquery/tests/system.py
Outdated
self.assertEqual(row_tuples, [(1,)]) | ||
|
||
def test_query_to_dataframe(self): | ||
import pandas as pd |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
bigquery/tests/unit/test_job.py
Outdated
def test_to_dataframe(self): | ||
import pandas as pd | ||
|
||
begun_resource = self._makeResource() |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
bigquery/tests/unit/test_job.py
Outdated
|
||
self.assertIsInstance(df, pd.DataFrame) | ||
self.assertEqual(len(df), 4) | ||
self.assertEqual(list(df), ['name', 'age']) |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
bigquery/tests/unit/test_job.py
Outdated
self.assertEqual(len(df), 4) | ||
self.assertEqual(list(df), ['name', 'age']) | ||
|
||
def test_to_dataframe_w_empty_results(self): |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
7b4d967
to
edecb93
Compare
@@ -1949,6 +1949,15 @@ def result(self, timeout=None, retry=DEFAULT_RETRY): | |||
return self._client.list_rows(dest_table, selected_fields=schema, | |||
retry=retry) | |||
|
|||
def to_dataframe(self): | |||
import pandas as pd |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
bigquery/tests/system.py
Outdated
@@ -1235,6 +1235,23 @@ def test_query_future(self): | |||
row_tuples = [r.values() for r in iterator] | |||
self.assertEqual(row_tuples, [(1,)]) | |||
|
|||
def test_query_to_dataframe(self): |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
bigquery/tests/unit/test_job.py
Outdated
@@ -2720,6 +2720,70 @@ def test_reload_w_alternate_client(self): | |||
self.assertEqual(req['path'], PATH) | |||
self._verifyResourceProperties(job, RESOURCE) | |||
|
|||
def test_to_dataframe(self): |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
+1 to what @dhermes said. We should provide a helpful message if they use a method that requires pandas (I also doubt the folks who want to use this will install bigquery before pandas), and we should provide the extra for pip. |
@@ -1949,6 +1949,15 @@ def result(self, timeout=None, retry=DEFAULT_RETRY): | |||
return self._client.list_rows(dest_table, selected_fields=schema, | |||
retry=retry) | |||
|
|||
def to_dataframe(self): | |||
import pandas as pd |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
def to_dataframe(self): | ||
import pandas as pd | ||
|
||
iterator = self.result() |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
edecb93
to
52a5e5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see, nox -s cover
will fail unless you:
- Explicitly add a "no cover" pragma around these tests and functions (not my preference)
- Install
pandas
into the text environment (session.install('-e', '.[pandas]')
)
def to_dataframe(self): | ||
"""Create a pandas DataFrame from the query results. | ||
|
||
:rtype: ``pandas.DataFrame`` |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
bigquery/setup.py
Outdated
@@ -58,6 +58,10 @@ | |||
'requests >= 2.18.0', | |||
] | |||
|
|||
EXTRAS_REQUIREMENTS = { | |||
'pandas': ['pandas >= 0.3.0'], |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
So @alixhami, after discussion with @jonparrott we have decided it is "preferred" if all new docstrings are in the Google style (vs. traditional Sphinx). I'm happy to help out / chat over hangouts / send you some code snippets. Can you change the newly added docstring to Google style? |
LGTM, but you need to resolve the coverage issues |
@@ -1965,7 +1965,7 @@ def to_dataframe(self): | |||
ValueError: If the `pandas` library cannot be imported. | |||
|
|||
""" | |||
if pandas is None: | |||
if pandas is None: # pragma: NO COVER |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
@alixhami This is good to merge once CI goes green, right? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though I think it could be a little more general. I'd love if this was a method of the iterator we return from QueryJob.results()
and Client.list_rows()
.
@@ -1949,6 +1953,29 @@ def result(self, timeout=None, retry=DEFAULT_RETRY): | |||
return self._client.list_rows(dest_table, selected_fields=schema, | |||
retry=retry) | |||
|
|||
def to_dataframe(self): | |||
"""Create a pandas DataFrame from the query results. |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
column_headers = [field.name for field in query_results.schema] | ||
rows = [row.values() for row in query_results] | ||
|
||
return pandas.DataFrame(rows, columns=column_headers) |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
a00c9da
to
1458da6
Compare
… if it isn't installed
d345624
to
0802ca8
Compare
@alixhami This is failing on Python 2.7, distinction being between |
This is a basic implementation of a method converting query results to a pandas DataFrame. I am hoping for comments on how best to implement this. Here is an example of how to use the current implementation:
df = client.query(QUERY).to_dataframe()
The intent is that pandas would be an optional dependency, and would not be required unless the DataFrame functionality is used. This is not yet handled in my current implementation, which is why CircleCI will fail. I ran the tests locally with pandas installed and they pass.
The base of this PR is #4350