-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-6291] Generic BigQuery schema load tests metrics #7614
[BEAM-6291] Generic BigQuery schema load tests metrics #7614
Conversation
08bb6e5
to
de395da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for simplification and more consistency!
Please see my comments.
def _prepare_schema(self, schemas): | ||
return [_get_schema_field(schema) for schema in schemas] | ||
def _prepare_schema(self): | ||
return [get_schema_field(row) for row in SCHEMA] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be simplified, if you rename the key type
to field_type
in SCHEMA
:
SCHEMA = [
{'name': ID_LABEL,
'field_type': 'STRING',
'mode': 'REQUIRED'
},
....
then this line could be simplified to:
return [SchemaField(**row) for row in SCHEMA]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! That's really clever solution. I'm changing it this way.
def match_and_save(self, result_list): | ||
rows_tuple = tuple(self._match_inserts_by_schema(result_list)) | ||
self._insert_data(rows_tuple) | ||
def match_and_save(self, results_lists): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you document what type results_list
is?
It seems that each item in results_list
is a list of dictionaries, and each dict looks like:
{'label': SUBMIT_TIMESTAMP_LABEL, 'value': time.time()}
but I'm not 100% sure.
I think this module would be easier to understand if each item in results_list
was a single dict:
{
ID_LABEL: uuid,
SUBMIT_TIMESTAMP_LABEL: time.time(),
METRICS_TYPE_LABEL: RUNTIME_METRIC,
VALUE_LABEL: value,
}
Note that _bq_client.insert_rows()
also accepts a list of dicts so there would be no need to convert the above to tuple form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you in 100%. I had same impression this naming is not so clear. I will refactor it according to suggestions. Hopefully it will simplify.
… documentation and pipelineoption.
8223ab0
to
ff8d527
Compare
Thank you @udim for the review. It was really helpful. I applied your comments, do you think it looks ok now? |
Run Python PreCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Sorry for the delay
No problem, thank you @udim ! |
It was decided to change tables to have
metrics
column where it will be name of metric which is collected.Also added two minor changes:
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.It will help us expedite review of your Pull Request if you tag someone (e.g.
@username
) to look at it.Post-Commit Tests Status (on master branch)