[druid] Updating refresh logic #4655

john-bodley · 2018-03-21T04:21:27Z

Though the term "refresh" is somewhat vague from a Druid metadata perspective I sense this translates to create or update. Previously we were creating or updates Druid columns but only creating Druid metrics when the Druid metadata was synced/refreshed.

This PR ensures that refreshing is consistent for both Druid columns and metrics and specifically addresses the following:

Removes redundancy by only controlling metric specifications within the DruidMetric class. Previously there was somewhat duplicate logic for both the DruidColumn and DruidMetric class.
Renames generate_metrics with refresh_metrics to imply that we're both creating and updating.
Updates the SQL filters to use IN rather than a series of ORs.
Added metric creation for float types.
Adds the missing migration for creating Druid uniqueness constraints to the columns and metrics tables which were added in Adding YAML Import-Export for Datasources to CLI #3978. Note to avoid this MySQL issue the metric_name column was reduced from 512 to 255 characters.
Corrects the --merge options for the refresh_druid command, which should be a flag (true/false) rather than an option.
Note I didn't want to change the structure of get_metrics in terms of checking whether said metric already exists to ensure consistency with SQLA, hence why the merging logic is handled in refresh_metrics.

@fabianmenges I only added the missing Druid migrations however I believe there are additional migrations from your PR (#3978) which are missing for the following tables:

table_columns
tables

to: @mistercrunch @Mogball

codecov-io · 2018-03-21T06:59:56Z

Codecov Report

Merging #4655 into master will increase coverage by 0.12%.
The diff coverage is 90%.

@@            Coverage Diff             @@
##           master    #4655      +/-   ##
==========================================
+ Coverage   71.37%   71.49%   +0.12%     
==========================================
  Files         190      190              
  Lines       14918    14911       -7     
  Branches     1102     1102              
==========================================
+ Hits        10648    10661      +13     
+ Misses       4267     4247      -20     
  Partials        3        3

Impacted Files	Coverage Δ
superset/cli.py	`48.38% <ø> (ø)`	⬆️
superset/connectors/druid/views.py	`68.02% <0%> (ø)`	⬆️
superset/connectors/druid/models.py	`78.85% <100%> (+2.5%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ec06967...a334220. Read the comment docs.

fabianmenges

Looks good to me. I'll check how the constraints for the table you mentioned are.

fabianmenges · 2018-03-21T19:51:32Z

tests/druid_tests.py

        'metric1': {
            'type': 'FLOAT', 'hasMultipleValues': False,
-            'size': 100000, 'cardinality': None, 'errorMessage': None},
-    },
-    'aggregators': {


Are we not testing aggregators anymore?

@fabianmenges I'm not the original author of this code so I don't have complete context, but from what I observed whilst debugging was these aggregators were not being tested. I've re-added this in case I was wrong.

Mogball · 2018-03-21T20:07:45Z

superset/connectors/druid/models.py

-        """Generate metrics based on the column metadata"""
-        metrics = self.get_metrics()
-        dbmetrics = (
-            db.session.query(DruidMetric)


Running a loop to issue a query for each metric/column means that many queries have to be made, as opposed to just one to grab them all at once. If you've got tons of metrics and such, this adds up and can be reasonably slow

@Mogball that was one of my concerns and I agree with your comment. I've refactored the logic to do one query per DruidColumn.

mistercrunch · 2018-03-21T23:54:33Z

superset/connectors/druid/models.py

@@ -220,13 +220,13 @@ def refresh(self, datasource_names, merge_flag, refreshAll):
                    if datatype == 'hyperUnique' or datatype == 'thetaSketch':
                        col_obj.count_distinct = True
                    # Allow sum/min/max for long or double
-                    if datatype == 'LONG' or datatype == 'DOUBLE':
+                    if datatype == 'LONG' or datatype in ('FLOAT', 'DOUBLE'):


could go col_obj.is_num() that comes from the base column class and includes all these

Mogball · 2018-03-22T04:01:12Z

superset/connectors/druid/models.py

-                with db.session.no_autoflush:
-                    db.session.add(metric)
+    def refresh_metrics(self):
+        for col in self.columns:


I mean the same could apply here as well, where it's possible to combine all of the columns' metrics into one query

(cherry picked from commit f9d85bd)

mistercrunch · 2018-04-23T19:19:39Z

superset/migrations/versions/f231d82b9b26_.py

+    # Add the missing uniqueness constraints.
+    for table, column in names.items():
+        with op.batch_alter_table(table, naming_convention=conv) as batch_op:
+            batch_op.create_unique_constraint(


Hit an issue on this line while upgrading our staging. I wrapped the statement in a try block locally so that I could move forward

For the record it was something to the effect of the constraint existing already

john-bodley · 2018-04-23T19:36:43Z

@mistercrunch so you recall if the constraint which existed previously was added manually? As far as I can tell this constraint never existed and thus the upgrade/downgrade logic should be sound.

mistercrunch · 2018-04-24T00:08:47Z

Can't recall creating it. Could also be that it failed half way through before or timed out and got this error the next time around... Dunno.

john-bodley force-pushed the john-bodley-refresh-druid branch 5 times, most recently from 7c69592 to 36bccde Compare March 21, 2018 06:44

fabianmenges reviewed Mar 21, 2018

View reviewed changes

Mogball reviewed Mar 21, 2018

View reviewed changes

john-bodley force-pushed the john-bodley-refresh-druid branch 2 times, most recently from 51564ee to 75e886e Compare March 21, 2018 20:36

mistercrunch reviewed Mar 21, 2018

View reviewed changes

john-bodley force-pushed the john-bodley-refresh-druid branch 4 times, most recently from acd8e5a to ca91e67 Compare March 22, 2018 00:52

Mogball reviewed Mar 22, 2018

View reviewed changes

[druid] Updating refresh logic

a334220

john-bodley force-pushed the john-bodley-refresh-druid branch from ca91e67 to a334220 Compare March 22, 2018 06:01

john-bodley merged commit f9d85bd into apache:master Mar 27, 2018

john-bodley deleted the john-bodley-refresh-druid branch March 27, 2018 01:35

john-bodley added a commit to john-bodley/superset that referenced this pull request Mar 27, 2018

[druid] Updating refresh logic (apache#4655)

ba44eb1

(cherry picked from commit f9d85bd)

mistercrunch reviewed Apr 23, 2018

View reviewed changes

michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018

[druid] Updating refresh logic (apache#4655)

1f86eb4

wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018

[druid] Updating refresh logic (apache#4655)

9c8ce9c

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.25.0 labels Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[druid] Updating refresh logic #4655

[druid] Updating refresh logic #4655

john-bodley commented Mar 21, 2018 •

edited

Loading

codecov-io commented Mar 21, 2018 •

edited

Loading

fabianmenges left a comment

fabianmenges Mar 21, 2018 •

edited

Loading

john-bodley Mar 21, 2018

Mogball Mar 21, 2018

john-bodley Mar 21, 2018

mistercrunch Mar 21, 2018

Mogball Mar 22, 2018

mistercrunch Apr 23, 2018

mistercrunch Apr 23, 2018

john-bodley commented Apr 23, 2018

mistercrunch commented Apr 24, 2018

[druid] Updating refresh logic #4655

[druid] Updating refresh logic #4655

Conversation

john-bodley commented Mar 21, 2018 • edited Loading

codecov-io commented Mar 21, 2018 • edited Loading

Codecov Report

fabianmenges left a comment

Choose a reason for hiding this comment

fabianmenges Mar 21, 2018 • edited Loading

Choose a reason for hiding this comment

john-bodley Mar 21, 2018

Choose a reason for hiding this comment

Mogball Mar 21, 2018

Choose a reason for hiding this comment

john-bodley Mar 21, 2018

Choose a reason for hiding this comment

mistercrunch Mar 21, 2018

Choose a reason for hiding this comment

Mogball Mar 22, 2018

Choose a reason for hiding this comment

mistercrunch Apr 23, 2018

Choose a reason for hiding this comment

mistercrunch Apr 23, 2018

Choose a reason for hiding this comment

john-bodley commented Apr 23, 2018

mistercrunch commented Apr 24, 2018

john-bodley commented Mar 21, 2018 •

edited

Loading

codecov-io commented Mar 21, 2018 •

edited

Loading

fabianmenges Mar 21, 2018 •

edited

Loading