Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document pandas-gbq vision and roadmap #149

Closed
max-sixty opened this issue Mar 13, 2018 · 12 comments · Fixed by #505
Closed

document pandas-gbq vision and roadmap #149

max-sixty opened this issue Mar 13, 2018 · 12 comments · Fixed by #505
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. type: process A process-related concern. May include testing, release, or the like.

Comments

@max-sixty
Copy link
Contributor

Both pandas-gbq and google-cloud-bigquery are doing many of the same things, and increasingly so (e.g. .to_dataframe() in google-cloud-bigquery)

  • Are there different use cases? Can we define those?
  • Should we focus development on one and wrap the other? Even if not wholly, for a subset of functionality?
  • Is there some direction from Google? @tswast spends a lot of time on both libraries so he is probably best placed to offer guidance
@tswast
Copy link
Collaborator

tswast commented Mar 13, 2018

Yeah, there's overlap. There are some things that pandas-gbq handles well that google-cloud-bigquery doesn't (or at all).

Not supported at all in google-cloud-bigquery:

Not supported well in google-cloud-bigquery:

  • Dataframes with multiple indexes.
  • Probably missing some of the bug fixes that have gone into pandas-gbq.

In google-cloud-bigquery but not in pandas-gbq:

  • IPython magics (just added in a recent PR, but not released).

The future? My thought is that more and more pandas-gbq becomes a thin wrapper over google-cloud-python. For example,

  • Make pandas-gbq use to_dataframe(). (I tried this but got many failing system tests due to problems with multiple index dataframes.)
  • Make pandas-gbq automatically register the IPython magic if available.

My thought is that we'd always have pandas-gbq specifically for the user authentication piece. If I can come up with a user auth proposal that would be acceptable for google-cloud-bigquery, my opinion would change, but right now that seems unlikely.

@tswast
Copy link
Collaborator

tswast commented Jun 26, 2018

Some updates:

I filed #175 and #174 while investigating the to_dataframe() function. Both of these issues would be solved by moving to google-cloud-bigquery's to_dataframe() method.

Also, @alixhami recently added a load_table_from_dataframe method to google-cloud-bigquery. It uses the Parquet file format to do loads, so it supports structure & array types at least as well as plain Pandas does (which I understand is not great, since it falls back to Python objects). I believe by using this method for to_gbq some issues such at #159 would be solved.

@max-sixty
Copy link
Contributor Author

It uses the Parquet file format to do loads

Great! and that would be faster and smaller too.

Are there any compatibility issues with using parquet? Is it OK for windows users?

I'd be up for making the change to just defer to that method.

(though, tangentially, I still think supporting nested structures is going to set bad expectations; coming from someone who uses structs & arrays in BQ a lot)

@tswast
Copy link
Collaborator

tswast commented Jun 26, 2018

The only caveat for Windows users is that they can't use Python 2.7, they have to use Python 3 because it uses PyArrow under the covers. googleapis/google-cloud-python#5441 (comment)

@max-sixty
Copy link
Contributor Author

OK cool. I guess we could leave the old implementation in there and provide a fallback option until the end of the year

@tswast
Copy link
Collaborator

tswast commented Jun 27, 2018

Ah, TIL Pandas is dropping support for 2.7 at the end of the year. Thanks for pointing that out.

@tswast
Copy link
Collaborator

tswast commented Jan 4, 2019

To make this task more concrete, I'd like to propose the two following sub-tasks:

  • read_gbq calls google-cloud-bigquery's to_dataframe under the covers. Now that pandas-gbq uses the same logic as pandas for null handling, I don't expect any change in behavior.
    • I don't know how we'd implement a progress bar for downloading the dataframe. We may want to upstream the progress bar features (using tqdm) to google-cloud-bigquery library or add some sort of hook so that we can show progress bar.
  • to_gbq calls google-cloud-bigquery's load_table_from_dataframe. load_table_from_dataframe uses Parquet rather CSV but is otherwise quite similar. It may work better with struct and array columns.

With the exception of schema overriding, I think it should be possible to implement these subtasks without changing the public interface of pandas-gbq.

@max-sixty
Copy link
Contributor Author

max-sixty commented Jan 4, 2019

I think the to_gbq in google-cloud-bigquery is unambiguously better. I'll check, but I don't think the types issues are material. Agree on our nice progress bar though!

I had thought our implementation of read_gbq was a bit faster and solved some edge cases, but that may not be correct / may be out of date?

@tswast
Copy link
Collaborator

tswast commented Jan 7, 2019

Performance-wise, I don't see a difference at the moment. Both create a DataFrame from an iterable of all rows in the result set.

google-cloud-bigquery

https://github.com/googleapis/google-cloud-python/blob/b70281ecdc424ebee5869253da8fbc97ec21fc03/bigquery/google/cloud/bigquery/table.py#L1318-L1323

pandas-gbq

https://github.com/pydata/pandas-gbq/blob/ecc695f29298a405c375976a5579ad8c93666785/pandas_gbq/gbq.py#L668

I think previously pandas-gbq created a DataFrame for each page and concatted them together. Maybe that was faster?

pandas-gbq does some mapping to dtypes based on the schema.

https://github.com/pydata/pandas-gbq/blob/ecc695f29298a405c375976a5579ad8c93666785/pandas_gbq/gbq.py#L671-L680

I'm not sure how necessary this is.

pandas-gbq also has support for setting an index column,

https://github.com/pydata/pandas-gbq/blob/ecc695f29298a405c375976a5579ad8c93666785/pandas_gbq/gbq.py#L842

and also reordering the columns,

https://github.com/pydata/pandas-gbq/blob/ecc695f29298a405c375976a5579ad8c93666785/pandas_gbq/gbq.py#L853

We could keep this logic in pandas-gbq for now and maybe upstream it to google-cloud-python.

@tswast
Copy link
Collaborator

tswast commented Feb 16, 2019

googleapis/google-cloud-python#7370 points out that the BQ schema logic is different in pandas-gbq to_gbq compared to what google-cloud-biguery does. I think we'll want googleapis/google-cloud-python#7370 to be implemented (allow setting explicit schema in load_table_from_dataframe) before moving over to use that method.

@tswast tswast added the type: process A process-related concern. May include testing, release, or the like. label Nov 6, 2020
@tswast
Copy link
Collaborator

tswast commented Dec 7, 2020

Finally added CSV support to google-cloud-bigquery in https://github.com/googleapis/python-bigquery/releases/tag/v2.6.0 This should allow us to start using load_table_from_dataframe without having to change default serialization format.

I imagine we'll want to support older versions of google-cloud-bigquery for a while, so we should keep the existing CSV serialization logic around to fall back to in those cases.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Jul 17, 2021
@tswast tswast changed the title Coordination with google-cloud-bigquery? document pandas-gbq vision and roadmap Jul 19, 2021
@tswast tswast self-assigned this Jul 19, 2021
@tswast
Copy link
Collaborator

tswast commented Jul 19, 2021

In the interest in not keeping issues open forever, I'm going to treat this issue as a request to document the project vision/roadmap. That should be useful for contributors and also understanding the purpose of this project compared to using the pandas connector in google-cloud-bigquery directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. type: process A process-related concern. May include testing, release, or the like.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants