-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[druid] Updating refresh logic #4655
[druid] Updating refresh logic #4655
Conversation
7c69592
to
36bccde
Compare
Codecov Report
@@ Coverage Diff @@
## master #4655 +/- ##
==========================================
+ Coverage 71.37% 71.49% +0.12%
==========================================
Files 190 190
Lines 14918 14911 -7
Branches 1102 1102
==========================================
+ Hits 10648 10661 +13
+ Misses 4267 4247 -20
Partials 3 3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I'll check how the constraints for the table you mentioned are.
tests/druid_tests.py
Outdated
'metric1': { | ||
'type': 'FLOAT', 'hasMultipleValues': False, | ||
'size': 100000, 'cardinality': None, 'errorMessage': None}, | ||
}, | ||
'aggregators': { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not testing aggregators
anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabianmenges I'm not the original author of this code so I don't have complete context, but from what I observed whilst debugging was these aggregators were not being tested. I've re-added this in case I was wrong.
superset/connectors/druid/models.py
Outdated
"""Generate metrics based on the column metadata""" | ||
metrics = self.get_metrics() | ||
dbmetrics = ( | ||
db.session.query(DruidMetric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running a loop to issue a query for each metric/column means that many queries have to be made, as opposed to just one to grab them all at once. If you've got tons of metrics and such, this adds up and can be reasonably slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mogball that was one of my concerns and I agree with your comment. I've refactored the logic to do one query per DruidColumn
.
51564ee
to
75e886e
Compare
superset/connectors/druid/models.py
Outdated
@@ -220,13 +220,13 @@ def refresh(self, datasource_names, merge_flag, refreshAll): | |||
if datatype == 'hyperUnique' or datatype == 'thetaSketch': | |||
col_obj.count_distinct = True | |||
# Allow sum/min/max for long or double | |||
if datatype == 'LONG' or datatype == 'DOUBLE': | |||
if datatype == 'LONG' or datatype in ('FLOAT', 'DOUBLE'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could go col_obj.is_num()
that comes from the base column class and includes all these
acd8e5a
to
ca91e67
Compare
with db.session.no_autoflush: | ||
db.session.add(metric) | ||
def refresh_metrics(self): | ||
for col in self.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the same could apply here as well, where it's possible to combine all of the columns' metrics into one query
ca91e67
to
a334220
Compare
(cherry picked from commit f9d85bd)
# Add the missing uniqueness constraints. | ||
for table, column in names.items(): | ||
with op.batch_alter_table(table, naming_convention=conv) as batch_op: | ||
batch_op.create_unique_constraint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hit an issue on this line while upgrading our staging. I wrapped the statement in a try block locally so that I could move forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record it was something to the effect of the constraint existing already
@mistercrunch so you recall if the constraint which existed previously was added manually? As far as I can tell this constraint never existed and thus the upgrade/downgrade logic should be sound. |
Can't recall creating it. Could also be that it failed half way through before or timed out and got this error the next time around... Dunno. |
Though the term "refresh" is somewhat vague from a Druid metadata perspective I sense this translates to create or update. Previously we were creating or updates Druid columns but only creating Druid metrics when the Druid metadata was synced/refreshed.
This PR ensures that refreshing is consistent for both Druid columns and metrics and specifically addresses the following:
DruidMetric
class. Previously there was somewhat duplicate logic for both theDruidColumn
andDruidMetric
class.generate_metrics
withrefresh_metrics
to imply that we're both creating and updating.IN
rather than a series ofOR
s.columns
andmetrics
tables which were added in Adding YAML Import-Export for Datasources to CLI #3978. Note to avoid this MySQL issue themetric_name
column was reduced from 512 to 255 characters.--merge
options for therefresh_druid
command, which should be a flag (true/false) rather than an option.get_metrics
in terms of checking whether said metric already exists to ensure consistency with SQLA, hence why the merging logic is handled inrefresh_metrics
.@fabianmenges I only added the missing Druid migrations however I believe there are additional migrations from your PR (#3978) which are missing for the following tables:
table_columns
tables
to: @mistercrunch @Mogball