-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to handle complex avro types for Avro Schema Definition copying to Big Query Schema Definitions #2224
Conversation
…g to Big Query Schema descriptions
luigi/contrib/bigquery_avro.py
Outdated
bq_map_value_field[u'description'] = avro_map_value.items.doc | ||
bq_map_value_field[u'fields'] = get_fields_with_description(bq_map_value_field[u'fields'], avro_map_value.items.fields_dict) | ||
|
||
#Unions nested in maps are not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment repeats what the code says and doesn't explain "why". Is it because we leave implementation for later? Then we could explain exactly that, or say "TODO" instead of "not supported". Is it for another reason? It's hard to tell for a new reader of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the code to have a description above about what is currently supported and changed the comments to reflect that it simply is not yet implemented
luigi/contrib/bigquery_avro.py
Outdated
|
||
#Enum Support | ||
if field_type is avro.schema.EnumSchema: | ||
field[u'description'] = avro_field.type.doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this and the following assignments overwrite the field[u'description'] = avro_field.doc
above intentionally? I wonder if the type documentation should really take precedence like that. The field documentation is presumably more specific to the concrete usage and the dataset we're loading, thus more interesting to the user than the type documentation, isn't it? Conceivably both could be useful... Concatenate? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned up the confusion considerably here. Realistically grabbing the data from "avro_field.doc" only works in 3 scenarios: PrimitiveSchema, MapSchema, and PrimitiveSchema nested in a UnionSchema. To make the code more explicit I moved this call directly to where it was needed. And just for clarity all other times the call returns 'None' which is automatically dropped so the field would populate no data to the Big Query description
luigi/contrib/bigquery_avro.py
Outdated
|
||
#Map Support | ||
if field_type is avro.schema.MapSchema: | ||
#directly bypass the key and value attributes in BigQuery Schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand what "bypass" means in a context like this. We are doing things with ...value...
on the following lines, so what was bypassed? Is this trying to explain the -1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Effectively what this does is Big Query generates a "key" attribute that doesn't have any direct interaction possible with the avro schema. In this use case the code "skips" the key and directly starts work with the value attribute. Cleaned up the comments to try and express that better.
luigi/contrib/bigquery_avro.py
Outdated
|
||
#Unions nested in maps are not supported | ||
if value_field_type is avro.schema.UnionSchema: | ||
logger.info("Unions inside of maps are not supported for documentation.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably no need to make info logs noisy with this. Just explain they're not supported in the docstring of this task, https://github.com/spotify/luigi/pull/2224/files#diff-ce7b46ced67552a1fc5e2bdc4ccb11faL22
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the details above and removed all excess logging
Some inconsistent style issues. In the surrounding code there's usually a space after #. Needs tests. See https://github.com/spotify/luigi/blob/master/test/contrib/bigquery_gcloud_test.py#L272 |
…d primitives inside of unions support. Updated Job description comments and removed unnecessary logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated all of my code to reflect your comments and I am going to start looking into updating the integration testing. The framework has a few pieces I don't understand so I will see if I can first set it up to run locally and then I will work on updating the test to reflect the new types that are supported
luigi/contrib/bigquery_avro.py
Outdated
|
||
#Enum Support | ||
if field_type is avro.schema.EnumSchema: | ||
field[u'description'] = avro_field.type.doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned up the confusion considerably here. Realistically grabbing the data from "avro_field.doc" only works in 3 scenarios: PrimitiveSchema, MapSchema, and PrimitiveSchema nested in a UnionSchema. To make the code more explicit I moved this call directly to where it was needed. And just for clarity all other times the call returns 'None' which is automatically dropped so the field would populate no data to the Big Query description
luigi/contrib/bigquery_avro.py
Outdated
|
||
#Map Support | ||
if field_type is avro.schema.MapSchema: | ||
#directly bypass the key and value attributes in BigQuery Schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Effectively what this does is Big Query generates a "key" attribute that doesn't have any direct interaction possible with the avro schema. In this use case the code "skips" the key and directly starts work with the value attribute. Cleaned up the comments to try and express that better.
luigi/contrib/bigquery_avro.py
Outdated
bq_map_value_field[u'description'] = avro_map_value.items.doc | ||
bq_map_value_field[u'fields'] = get_fields_with_description(bq_map_value_field[u'fields'], avro_map_value.items.fields_dict) | ||
|
||
#Unions nested in maps are not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the code to have a description above about what is currently supported and changed the comments to reflect that it simply is not yet implemented
luigi/contrib/bigquery_avro.py
Outdated
|
||
#Unions nested in maps are not supported | ||
if value_field_type is avro.schema.UnionSchema: | ||
logger.info("Unions inside of maps are not supported for documentation.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the details above and removed all excess logging
@ulzha Testing suite has been updated and has been successfully run locally 👍 Let me know if you see anything else :) |
@ulzha Boom Flake8 issues all resolved 👍 |
@ulzha any chance on you can take a look at this today? |
Update to handle complex avro types for Avro Schema Definition copying to Big Query Schema Definitions
Description
These Changes handle Complex Avro Schema documentation being copied into Big Query Schema descriptions
Motivation and Context
Allow most Avro documentation fields to be copied into Big Query Schema Descriptions
Have you tested this? If so, how?
I ran many different test avro schema scenarios to manage complex nested schema documentation