-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(bigquery): use pyarrow fallback for improved schema detection #9321
Conversation
@@ -110,8 +110,13 @@ def pyarrow_timestamp(): | |||
"TIME": pyarrow_time, | |||
"TIMESTAMP": pyarrow_timestamp, | |||
} | |||
ARROW_SCALARS_TO_BQ = { | |||
arrow_type(): bq_type # TODO: explain wht calling arrow_type() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're supposed to use Arrow type objects as dictionary keys. The factory functions all describe instances of data types (https://arrow.apache.org/docs/python/api/datatypes.html#factory-functions), so I don't think we're guaranteed on equality / consistent hash between two Arrow data types.
That said, I'm not sure what alternatives we have. We could have a function with a whole bunch of if statements calling the type checking functions: https://arrow.apache.org/docs/python/api/datatypes.html#type-checking but that sounds dirty and slow (though more guaranteed to be correct). For pandas, I use the string name of the dtype, but I'm not seeing an equivalent for Arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each pyarrow type has a numeric ID that we can use instead. It's true that we don't know if these IDs are going to stay the same forever, but the map is built dynamically, and the IDs are consistent within a particular pyarrow
version.
What do you think?
Edit: Well, for integers, for example, we assume pyarrow.int64
, but an integer in arrow could also be of another type such as pyarrow.int32
, and we need to map that to BigQuery's INT64
, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extended the types map in a provisional new commit, feel free to scrutinize. :)
currated_schema.append(schema_field) | ||
continue | ||
|
||
detected_type = ARROW_SCALARS_TO_BQ.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best if we make a function for Arrow type + field name -> BQ field now, since we know we'll want to handle repeated / nested fields at some point.
Add more pyarrow types, convert to pyarrow only the columns the schema could not be detected for, etc.
@tswast Updated the code and extended the tests.
Looking at the code, that would require changing the signatures of the following Pandas helpers:
Returning Unless there are serious performance problems reported, we are probably better off promoting the explicit schema usage IMO, and discouraging implicit schemas. |
Closes #9206.
This PR adds additional logic to schema autodetection if
pyarrow
is available, and the types could not be detected for all columns.How to test
Actual result (before the fix):
The backend responds with an error (incompatible types).
Expected result (after the fix):
The load job completes successfully, and column types are correctly detected (strings, dates...).
Misc
The code might not work for all scalar types, depending on how good the pyarrow's schema detection logic is. Which types to we care about the most?
We should update the test case to include all of them (we currently test strings and dates).