BigQuery: Get query results as pandas DataFrame #4354

alixhami · 2017-11-07T20:36:21Z

This is a basic implementation of a method converting query results to a pandas DataFrame. I am hoping for comments on how best to implement this. Here is an example of how to use the current implementation:
df = client.query(QUERY).to_dataframe()

The intent is that pandas would be an optional dependency, and would not be required unless the DataFrame functionality is used. This is not yet handled in my current implementation, which is why CircleCI will fail. I ran the tests locally with pandas installed and they pass.

~~The base of this PR is #4350~~

jba · 2017-11-07T20:40:54Z

I don't personally like the idea of an optional dependency. I guess I try to think of Python as a typed language as much as possible (and hopefully someday it will be).

Maybe instead of actually returning a pandas DataFrame, you could (a) expose whatever currently hidden information is needed to efficiently create one, and (b) add a different module that depends on this one and pandas, and creates a DataFrame.

dhermes · 2017-11-07T20:59:09Z

I don't personally like the idea of an optional dependency.

I don't particularly like it too much, but in this case, the pandas dependency is a bit too heavy to burden all users with. We could have a more helpful error message if the import fails. We also should provide an "extra" (e.g.) that allows installing via:

pip install google-cloud-bigquery[pandas-extra]

(you can name the "extra" anything you like, I just chose pandas-extra out of a hat).

bigquery/google/cloud/bigquery/job.py

@@ -1949,6 +1949,18 @@ def result(self, timeout=None, retry=DEFAULT_RETRY):
        return self._client.list_rows(dest_table, selected_fields=schema,
                                      retry=retry)

+    def to_dataframe(self):


bigquery/google/cloud/bigquery/job.py

+        return pd.DataFrame(rows, columns=column_headers)
+
+    def __iter__(self):
+        return iter(self.result())


bigquery/tests/system.py

@@ -1235,6 +1235,30 @@ def test_query_future(self):
        row_tuples = [r.values() for r in iterator]
        self.assertEqual(row_tuples, [(1,)])

+    def test_query_iter(self):


bigquery/tests/system.py

+        self.assertEqual(row_tuples, [(1,)])
+
+    def test_query_to_dataframe(self):
+        import pandas as pd


bigquery/tests/unit/test_job.py

+    def test_to_dataframe(self):
+        import pandas as pd
+
+        begun_resource = self._makeResource()


bigquery/tests/unit/test_job.py

+
+        self.assertIsInstance(df, pd.DataFrame)
+        self.assertEqual(len(df), 4)
+        self.assertEqual(list(df), ['name', 'age'])


bigquery/tests/unit/test_job.py

+        self.assertEqual(len(df), 4)
+        self.assertEqual(list(df), ['name', 'age'])
+
+    def test_to_dataframe_w_empty_results(self):


bigquery/google/cloud/bigquery/job.py

@@ -1949,6 +1949,15 @@ def result(self, timeout=None, retry=DEFAULT_RETRY):
        return self._client.list_rows(dest_table, selected_fields=schema,
                                      retry=retry)

+    def to_dataframe(self):
+        import pandas as pd


bigquery/tests/system.py

@@ -1235,6 +1235,23 @@ def test_query_future(self):
        row_tuples = [r.values() for r in iterator]
        self.assertEqual(row_tuples, [(1,)])

+    def test_query_to_dataframe(self):


bigquery/tests/unit/test_job.py

@@ -2720,6 +2720,70 @@ def test_reload_w_alternate_client(self):
        self.assertEqual(req['path'], PATH)
        self._verifyResourceProperties(job, RESOURCE)

+    def test_to_dataframe(self):


theacodes · 2017-11-09T21:20:36Z

+1 to what @dhermes said. We should provide a helpful message if they use a method that requires pandas (I also doubt the folks who want to use this will install bigquery before pandas), and we should provide the extra for pip.

bigquery/google/cloud/bigquery/job.py

@@ -1949,6 +1949,15 @@ def result(self, timeout=None, retry=DEFAULT_RETRY):
        return self._client.list_rows(dest_table, selected_fields=schema,
                                      retry=retry)

+    def to_dataframe(self):
+        import pandas as pd


bigquery/google/cloud/bigquery/job.py

+    def to_dataframe(self):
+        import pandas as pd
+
+        iterator = self.result()


dhermes

As you can see, nox -s cover will fail unless you:

Explicitly add a "no cover" pragma around these tests and functions (not my preference)
Install pandas into the text environment (session.install('-e', '.[pandas]'))

bigquery/google/cloud/bigquery/job.py

+    def to_dataframe(self):
+        """Create a pandas DataFrame from the query results.
+
+        :rtype: ``pandas.DataFrame``


bigquery/setup.py

@@ -58,6 +58,10 @@
    'requests >= 2.18.0',
 ]

+EXTRAS_REQUIREMENTS = {
+    'pandas': ['pandas >= 0.3.0'],


dhermes · 2017-11-10T21:56:30Z

So @alixhami, after discussion with @jonparrott we have decided it is "preferred" if all new docstrings are in the Google style (vs. traditional Sphinx).

I'm happy to help out / chat over hangouts / send you some code snippets.

Can you change the newly added docstring to Google style?

dhermes · 2017-11-10T22:53:48Z

LGTM, but you need to resolve the coverage issues

bigquery/google/cloud/bigquery/job.py

@@ -1965,7 +1965,7 @@ def to_dataframe(self):
            ValueError: If the `pandas` library cannot be imported.

        """
-        if pandas is None:
+        if pandas is None:  # pragma: NO COVER


dhermes · 2017-11-13T17:35:08Z

@alixhami This is good to merge once CI goes green, right?

alixhami · 2017-11-13T17:43:54Z

@dhermes it should be good to merge since all the comments have now been addressed. @tswast what do you think?

dhermes · 2017-11-13T17:54:39Z

lint failed 😀

tswast

LGTM, though I think it could be a little more general. I'd love if this was a method of the iterator we return from QueryJob.results() and Client.list_rows().

bigquery/google/cloud/bigquery/job.py

@@ -1949,6 +1953,29 @@ def result(self, timeout=None, retry=DEFAULT_RETRY):
        return self._client.list_rows(dest_table, selected_fields=schema,
                                      retry=retry)

+    def to_dataframe(self):
+        """Create a pandas DataFrame from the query results.


bigquery/google/cloud/bigquery/job.py

+        column_headers = [field.name for field in query_results.schema]
+        rows = [row.values() for row in query_results]
+
+        return pandas.DataFrame(rows, columns=column_headers)


… if it isn't installed

theacodes · 2017-12-04T22:04:27Z

This LGTM, @tseaver @dhermes did y'all have any reservations before merging this?

dhermes · 2017-12-04T23:38:58Z

@alixhami This is failing on Python 2.7, distinction being between str and unicode

…ame. (#4354)" This reverts commit 3511e87.

…oudPlatform/python-docs-samples#4361) fixes #4354

alixhami added api: bigquery Issues related to the BigQuery API. do not merge Indicates a pull request not ready for merge, due to either quality or timing. labels Nov 7, 2017

alixhami requested review from tseaver, tswast, theacodes, dhermes and jba November 7, 2017 20:36

alixhami requested a review from lukesneeringer as a code owner November 7, 2017 20:36

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Nov 7, 2017

dhermes reviewed Nov 7, 2017

View reviewed changes

alixhami force-pushed the pandas-integration branch from 7b4d967 to edecb93 Compare November 7, 2017 21:37

tseaver suggested changes Nov 7, 2017

View reviewed changes

theacodes reviewed Nov 9, 2017

View reviewed changes

alixhami force-pushed the pandas-integration branch from edecb93 to 52a5e5f Compare November 10, 2017 18:08

dhermes reviewed Nov 10, 2017

View reviewed changes

dhermes reviewed Nov 11, 2017

View reviewed changes

bigquery/google/cloud/bigquery/job.py Outdated

@@ -1965,7 +1965,7 @@ def to_dataframe(self):

ValueError: If the `pandas` library cannot be imported.

"""

if pandas is None:

if pandas is None: # pragma: NO COVER

This comment was marked as spam.

Sign in to view

alixhami removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Nov 13, 2017

tswast reviewed Nov 13, 2017

View reviewed changes

tswast mentioned this pull request Nov 14, 2017

ENH: Convert read_gbq() function to use google-cloud-python googleapis/python-bigquery-pandas#25

Merged

tswast reviewed Nov 14, 2017

View reviewed changes

alixhami force-pushed the pandas-integration branch 2 times, most recently from a00c9da to 1458da6 Compare November 15, 2017 02:29

alixhami added 16 commits November 22, 2017 12:27

imports pandas at module level and raises exception in to_dataframe()…

2b8ca85

… if it isn't installed

adds pandas as extra for installation

5c52dc6

updates docstring to google style

484ab91

adds pandas extra to nox environment

4db3f4b

adds 'no cover' pragma for pandas import errors

a31e79d

adds test for when pandas is None

03b7fd5

fixes lint error

0c7bf88

adds RowIterator class

84994a7

moves to_dataframe() to RowIterator

04f76f5

adds test for pandas handling of basic BigQuery data types

4fd0cc0

moves schema to RowIterator constructor

321b56a

adds tests for column dtypes

da52040

adds test for query results to_dataframe() with nested schema

83d9e3c

updates system test for to_dataframe to check types

10fcd7c

adds to_dataframe() helper to QueryJob

6762f95

updates pandas version to latest version that passes unit tests

0802ca8

alixhami force-pushed the pandas-integration branch from d345624 to 0802ca8 Compare November 22, 2017 20:34

alixhami changed the title ~~BigQuery: Adds QueryJob.to_dataframe()~~ BigQuery: Get query results as pandas DataFrame Nov 27, 2017

theacodes approved these changes Dec 4, 2017

View reviewed changes

tseaver approved these changes Dec 4, 2017

View reviewed changes

theacodes merged commit 3511e87 into googleapis:master Dec 4, 2017

alixhami deleted the pandas-integration branch December 4, 2017 22:45

alixhami added a commit that referenced this pull request Dec 4, 2017

Revert "BigQuery: Add ability to get query results as a Pandas datafr…

19635ba

…ame. (#4354)" This reverts commit 3511e87.

This was referenced Dec 4, 2017

Revert "BigQuery: Get query results as pandas DataFrame" #4527

Closed

BigQuery: fixes python 2.7 system test error caused by PR 4354 #4528

Merged

tseaver mentioned this pull request Jan 4, 2018

Prep bigquery-0.29.0 release. #4690

Merged

tswast mentioned this pull request Mar 23, 2018

Add monitoring missing dependency: pandas #5100

Closed

parthea pushed a commit that referenced this pull request Sep 22, 2023

testing(videointelligence): retry harder upon 409s [(#4361)](GoogleCl…

07cf916

…oudPlatform/python-docs-samples#4361) fixes #4354

BigQuery: Get query results as pandas DataFrame #4354

BigQuery: Get query results as pandas DataFrame #4354

Conversation

alixhami commented Nov 7, 2017 • edited Loading

jba commented Nov 7, 2017

dhermes commented Nov 7, 2017

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

theacodes commented Nov 9, 2017

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

dhermes left a comment

Choose a reason for hiding this comment

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

dhermes commented Nov 10, 2017

dhermes commented Nov 10, 2017

This comment was marked as spam.

dhermes commented Nov 13, 2017

alixhami commented Nov 13, 2017

dhermes commented Nov 13, 2017

tswast left a comment

Choose a reason for hiding this comment

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

theacodes commented Dec 4, 2017

dhermes commented Dec 4, 2017

alixhami commented Nov 7, 2017 •

edited

Loading