-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: Load to table from dataframe without index #5572
Comments
Details for how to do this are here: googleapis/python-bigquery-pandas#133 (comment) More than happy for this to be implemented here rather than |
Any updates on this being updated for the bigquery api vs. the pandas-gbq module? |
@sungchun12 I just tried the solution @max-sixty posted above for the bigquery client api and it worked fine. Load the job configuration and override the schema as suggested.
|
@mikeymezher , thanks for getting back to me! I'll try it out and let you know. Do you know if one is more performant over the other in your hands-on experience? |
Haven't tested, but anecdotally I've noticed pandas-gbq to be faster than the client library. But there are instances where the client library is needed. Writing to partitioned tables for instance. |
They are implementation details, but pandas-gbq uses CSV whereas google-cloud-bigquery uses parquet as the serialization format. The reason for this is to support STRUCT / ARRAY BigQuery columns (though these aren't supported in pandas, anyway). Implementation-wise, I just noticed pandas provides a way to override the parquet engine's default behavior with an
From https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet |
@tswast , I would LOVE "index=False" functionality in the google-cloud-bigquery package as it would allow me to remove pandas-gbq imports and have a consistent API to work with bigquery...at least for my use cases in having to not build override schema configurations. |
@tswast Has the |
@cavindsouza not yet. Right now you can avoid writing indexes by passing in a |
FYI: #9064 and #9049 are changing the index behavior, as a schema will be automatically populated in more cases now. We might actually have a need to explicitly add indexes to the table. Currently, it's inconsistent when an index will be added and when not. It depends on if the schema is populated and which parquet engine is used to serialized the DataFrame. Preferred option Check if index (or indexes if multi-index) name(s) are present in Edge case: What if index name matches that of a column name? Prefer serializing the column. Don't add the index in this case. Alternative Add a This makes it when to include indexes. This would allow the index dtype to be used to automatically determine the schema in some cases. When the index dtype is object, we'll need to add the index to the |
Follow-up from #9064 When this feature is added,
|
Once this feature is released, do the following to omit indexes:
To include indexes:
Code sample: google-cloud-python/bigquery/samples/load_table_dataframe.py Lines 18 to 71 in a6ed945
|
client.load_table_from_dataframe() results in the dataframe index being loaded into the bigquery table.
Can the capability to load data to a table from a dataframe without having to load the index be implemented?
The text was updated successfully, but these errors were encountered: