document pandas-gbq vision and roadmap #149

max-sixty · 2018-03-13T21:02:23Z

Both pandas-gbq and google-cloud-bigquery are doing many of the same things, and increasingly so (e.g. .to_dataframe() in google-cloud-bigquery)

Are there different use cases? Can we define those?
Should we focus development on one and wrap the other? Even if not wholly, for a subset of functionality?
Is there some direction from Google? @tswast spends a lot of time on both libraries so he is probably best placed to offer guidance

The text was updated successfully, but these errors were encountered:

tswast · 2018-03-13T21:15:52Z

Yeah, there's overlap. There are some things that pandas-gbq handles well that google-cloud-bigquery doesn't (or at all).

Not supported at all in google-cloud-bigquery:

Uploading a dataframe to a BigQuery table. I would consider this part of the feature request at Add functions equivalent to create_rows and create_rows_json that create a table for you using a load job google-cloud-python#4553
Built-in user-based authentication (3-legged OAuth, 3LO). Based on conversations with @jonparrott, I think it's probably never that google-cloud-bigquery would support this. It would be inconsistent with other products, even though I think it would make sense for BigQuery.

Not supported well in google-cloud-bigquery:

Dataframes with multiple indexes.
Probably missing some of the bug fixes that have gone into pandas-gbq.

In google-cloud-bigquery but not in pandas-gbq:

IPython magics (just added in a recent PR, but not released).

The future? My thought is that more and more pandas-gbq becomes a thin wrapper over google-cloud-python. For example,

Make pandas-gbq use to_dataframe(). (I tried this but got many failing system tests due to problems with multiple index dataframes.)
Make pandas-gbq automatically register the IPython magic if available.

My thought is that we'd always have pandas-gbq specifically for the user authentication piece. If I can come up with a user auth proposal that would be acceptable for google-cloud-bigquery, my opinion would change, but right now that seems unlikely.

tswast · 2018-06-26T20:55:12Z

Some updates:

I filed #175 and #174 while investigating the to_dataframe() function. Both of these issues would be solved by moving to google-cloud-bigquery's to_dataframe() method.

Also, @alixhami recently added a load_table_from_dataframe method to google-cloud-bigquery. It uses the Parquet file format to do loads, so it supports structure & array types at least as well as plain Pandas does (which I understand is not great, since it falls back to Python objects). I believe by using this method for to_gbq some issues such at #159 would be solved.

max-sixty · 2018-06-26T21:54:07Z

It uses the Parquet file format to do loads

Great! and that would be faster and smaller too.

Are there any compatibility issues with using parquet? Is it OK for windows users?

I'd be up for making the change to just defer to that method.

(though, tangentially, I still think supporting nested structures is going to set bad expectations; coming from someone who uses structs & arrays in BQ a lot)

tswast · 2018-06-26T21:56:11Z

The only caveat for Windows users is that they can't use Python 2.7, they have to use Python 3 because it uses PyArrow under the covers. googleapis/google-cloud-python#5441 (comment)

max-sixty · 2018-06-27T01:14:42Z

OK cool. I guess we could leave the old implementation in there and provide a fallback option until the end of the year

tswast · 2018-06-27T01:17:47Z

Ah, TIL Pandas is dropping support for 2.7 at the end of the year. Thanks for pointing that out.

tswast · 2019-01-04T20:29:11Z

To make this task more concrete, I'd like to propose the two following sub-tasks:

read_gbq calls google-cloud-bigquery's to_dataframe under the covers. Now that pandas-gbq uses the same logic as pandas for null handling, I don't expect any change in behavior.
- I don't know how we'd implement a progress bar for downloading the dataframe. We may want to upstream the progress bar features (using tqdm) to google-cloud-bigquery library or add some sort of hook so that we can show progress bar.
to_gbq calls google-cloud-bigquery's load_table_from_dataframe. load_table_from_dataframe uses Parquet rather CSV but is otherwise quite similar. It may work better with struct and array columns.
- Logic for overriding the schema will be trickier as the schema is actually defined in the Parquet file.
- Object columns (and thus nullable types) are not supported by to_parquet in pandas. "Non supported types [for pandas's to_parquet] include Period and actual Python object types].
- Perhaps we want to wait on implementing the to_gbq logic until we have a better way to handle nullable columns in load_table_from_dataframe?

With the exception of schema overriding, I think it should be possible to implement these subtasks without changing the public interface of pandas-gbq.

max-sixty · 2019-01-04T22:36:43Z

I think the to_gbq in google-cloud-bigquery is unambiguously better. I'll check, but I don't think the types issues are material. Agree on our nice progress bar though!

I had thought our implementation of read_gbq was a bit faster and solved some edge cases, but that may not be correct / may be out of date?

tswast · 2019-01-07T14:57:33Z

Performance-wise, I don't see a difference at the moment. Both create a DataFrame from an iterable of all rows in the result set.

google-cloud-bigquery

https://github.com/googleapis/google-cloud-python/blob/b70281ecdc424ebee5869253da8fbc97ec21fc03/bigquery/google/cloud/bigquery/table.py#L1318-L1323

pandas-gbq

https://github.com/pydata/pandas-gbq/blob/ecc695f29298a405c375976a5579ad8c93666785/pandas_gbq/gbq.py#L668

I think previously pandas-gbq created a DataFrame for each page and concatted them together. Maybe that was faster?

pandas-gbq does some mapping to dtypes based on the schema.

https://github.com/pydata/pandas-gbq/blob/ecc695f29298a405c375976a5579ad8c93666785/pandas_gbq/gbq.py#L671-L680

I'm not sure how necessary this is.

pandas-gbq also has support for setting an index column,

https://github.com/pydata/pandas-gbq/blob/ecc695f29298a405c375976a5579ad8c93666785/pandas_gbq/gbq.py#L842

and also reordering the columns,

https://github.com/pydata/pandas-gbq/blob/ecc695f29298a405c375976a5579ad8c93666785/pandas_gbq/gbq.py#L853

We could keep this logic in pandas-gbq for now and maybe upstream it to google-cloud-python.

tswast · 2019-02-16T00:42:13Z

googleapis/google-cloud-python#7370 points out that the BQ schema logic is different in pandas-gbq to_gbq compared to what google-cloud-biguery does. I think we'll want googleapis/google-cloud-python#7370 to be implemented (allow setting explicit schema in load_table_from_dataframe) before moving over to use that method.

tswast · 2020-12-07T20:46:16Z

Finally added CSV support to google-cloud-bigquery in https://github.com/googleapis/python-bigquery/releases/tag/v2.6.0 This should allow us to start using load_table_from_dataframe without having to change default serialization format.

I imagine we'll want to support older versions of google-cloud-bigquery for a while, so we should keep the existing CSV serialization logic around to fall back to in those cases.

tswast · 2021-07-19T14:55:31Z

In the interest in not keeping issues open forever, I'm going to treat this issue as a request to document the project vision/roadmap. That should be useful for contributors and also understanding the purpose of this project compared to using the pandas connector in google-cloud-bigquery directly.

tswast mentioned this issue Jun 26, 2018

Add anchor links to versions in the changelog #191

Merged

tswast mentioned this issue Jan 4, 2019

Discussion: how to handle the new Int64 (nullable integer) dtype with pandas 0.24.0 #242

Closed

tswast mentioned this issue Jan 25, 2019

CLN: Use to_dataframe to download query results. #247

Merged

tswast added the type: process A process-related concern. May include testing, release, or the like. label Nov 6, 2020

tswast mentioned this issue Nov 6, 2020

refactor to use more logic from google-cloud-bigquery #339

Closed

2 tasks

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Jul 17, 2021

tswast changed the title ~~Coordination with google-cloud-bigquery?~~ document pandas-gbq vision and roadmap Jul 19, 2021

tswast self-assigned this Jul 19, 2021

tswast mentioned this issue Aug 20, 2021

migration to googleapis org #367

Closed

13 tasks

tswast mentioned this issue Feb 24, 2022

fix: avoid TypeError when executing DML statements with read_gbq #483

Merged

4 tasks

tswast mentioned this issue Mar 28, 2022

chore: add ROADMAP document describing the purpose of the package #505

Merged

4 tasks

parthea closed this as completed in #505 Apr 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document pandas-gbq vision and roadmap #149

document pandas-gbq vision and roadmap #149

max-sixty commented Mar 13, 2018

tswast commented Mar 13, 2018

tswast commented Jun 26, 2018

max-sixty commented Jun 26, 2018

tswast commented Jun 26, 2018

max-sixty commented Jun 27, 2018

tswast commented Jun 27, 2018 •

edited

Loading

tswast commented Jan 4, 2019

max-sixty commented Jan 4, 2019 •

edited

Loading

tswast commented Jan 7, 2019 •

edited

Loading

tswast commented Feb 16, 2019

tswast commented Dec 7, 2020

tswast commented Jul 19, 2021

document pandas-gbq vision and roadmap #149

document pandas-gbq vision and roadmap #149

Comments

max-sixty commented Mar 13, 2018

tswast commented Mar 13, 2018

tswast commented Jun 26, 2018

max-sixty commented Jun 26, 2018

tswast commented Jun 26, 2018

max-sixty commented Jun 27, 2018

tswast commented Jun 27, 2018 • edited Loading

tswast commented Jan 4, 2019

max-sixty commented Jan 4, 2019 • edited Loading

tswast commented Jan 7, 2019 • edited Loading

tswast commented Feb 16, 2019

tswast commented Dec 7, 2020

tswast commented Jul 19, 2021

tswast commented Jun 27, 2018 •

edited

Loading

max-sixty commented Jan 4, 2019 •

edited

Loading

tswast commented Jan 7, 2019 •

edited

Loading