Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to handle complex avro types for Avro Schema Definition copying to Big Query Schema Definitions #2224

Merged
merged 6 commits into from
Sep 5, 2017

Conversation

Mandalorian007
Copy link

Update to handle complex avro types for Avro Schema Definition copying to Big Query Schema Definitions

Description

These Changes handle Complex Avro Schema documentation being copied into Big Query Schema descriptions

Motivation and Context

Allow most Avro documentation fields to be copied into Big Query Schema Descriptions

Have you tested this? If so, how?

I ran many different test avro schema scenarios to manage complex nested schema documentation

bq_map_value_field[u'description'] = avro_map_value.items.doc
bq_map_value_field[u'fields'] = get_fields_with_description(bq_map_value_field[u'fields'], avro_map_value.items.fields_dict)

#Unions nested in maps are not supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment repeats what the code says and doesn't explain "why". Is it because we leave implementation for later? Then we could explain exactly that, or say "TODO" instead of "not supported". Is it for another reason? It's hard to tell for a new reader of the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the code to have a description above about what is currently supported and changed the comments to reflect that it simply is not yet implemented


#Enum Support
if field_type is avro.schema.EnumSchema:
field[u'description'] = avro_field.type.doc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this and the following assignments overwrite the field[u'description'] = avro_field.doc above intentionally? I wonder if the type documentation should really take precedence like that. The field documentation is presumably more specific to the concrete usage and the dataset we're loading, thus more interesting to the user than the type documentation, isn't it? Conceivably both could be useful... Concatenate? :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up the confusion considerably here. Realistically grabbing the data from "avro_field.doc" only works in 3 scenarios: PrimitiveSchema, MapSchema, and PrimitiveSchema nested in a UnionSchema. To make the code more explicit I moved this call directly to where it was needed. And just for clarity all other times the call returns 'None' which is automatically dropped so the field would populate no data to the Big Query description


#Map Support
if field_type is avro.schema.MapSchema:
#directly bypass the key and value attributes in BigQuery Schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand what "bypass" means in a context like this. We are doing things with ...value... on the following lines, so what was bypassed? Is this trying to explain the -1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively what this does is Big Query generates a "key" attribute that doesn't have any direct interaction possible with the avro schema. In this use case the code "skips" the key and directly starts work with the value attribute. Cleaned up the comments to try and express that better.


#Unions nested in maps are not supported
if value_field_type is avro.schema.UnionSchema:
logger.info("Unions inside of maps are not supported for documentation.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably no need to make info logs noisy with this. Just explain they're not supported in the docstring of this task, https://github.com/spotify/luigi/pull/2224/files#diff-ce7b46ced67552a1fc5e2bdc4ccb11faL22

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the details above and removed all excess logging

@ulzha
Copy link
Contributor

ulzha commented Aug 30, 2017

Some inconsistent style issues. In the surrounding code there's usually a space after #.

Needs tests. See https://github.com/spotify/luigi/blob/master/test/contrib/bigquery_gcloud_test.py#L272

…d primitives inside of unions support. Updated Job description comments and removed unnecessary logging.
Copy link
Author

@Mandalorian007 Mandalorian007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated all of my code to reflect your comments and I am going to start looking into updating the integration testing. The framework has a few pieces I don't understand so I will see if I can first set it up to run locally and then I will work on updating the test to reflect the new types that are supported


#Enum Support
if field_type is avro.schema.EnumSchema:
field[u'description'] = avro_field.type.doc
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up the confusion considerably here. Realistically grabbing the data from "avro_field.doc" only works in 3 scenarios: PrimitiveSchema, MapSchema, and PrimitiveSchema nested in a UnionSchema. To make the code more explicit I moved this call directly to where it was needed. And just for clarity all other times the call returns 'None' which is automatically dropped so the field would populate no data to the Big Query description


#Map Support
if field_type is avro.schema.MapSchema:
#directly bypass the key and value attributes in BigQuery Schema
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively what this does is Big Query generates a "key" attribute that doesn't have any direct interaction possible with the avro schema. In this use case the code "skips" the key and directly starts work with the value attribute. Cleaned up the comments to try and express that better.

bq_map_value_field[u'description'] = avro_map_value.items.doc
bq_map_value_field[u'fields'] = get_fields_with_description(bq_map_value_field[u'fields'], avro_map_value.items.fields_dict)

#Unions nested in maps are not supported
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the code to have a description above about what is currently supported and changed the comments to reflect that it simply is not yet implemented


#Unions nested in maps are not supported
if value_field_type is avro.schema.UnionSchema:
logger.info("Unions inside of maps are not supported for documentation.")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the details above and removed all excess logging

@Mandalorian007
Copy link
Author

@ulzha Testing suite has been updated and has been successfully run locally 👍 Let me know if you see anything else :)

@Mandalorian007
Copy link
Author

@ulzha Boom Flake8 issues all resolved 👍

@Mandalorian007
Copy link
Author

@ulzha any chance on you can take a look at this today?

@ulzha ulzha merged commit 50838d0 into spotify:master Sep 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants