Load async sql lab results early for Presto #4834

timifasubaa · 2018-04-17T00:41:45Z

This PR enables sqllab to load queries early after the first 100 rows are available. Since this uses fetchmany, it should be possible for all our dbs but this PR implements it for just Presto.

This has the benefit of makingusers wait less before seeing questy results and also enabling them iterate on their query quicker and also visualize much earlier on in the process.

Closes #4588

At a high level, the celery worker loads the first 100 results into an s3 bicket and loads it. When the rest of the data is ready, it combines the old and the new and puts it in the same s3 bucket.

@graceguo-supercat @john-bodley @mistercrunch @williaster

john-bodley · 2018-04-18T17:59:07Z

@timifasubaa would you mind including screenshots and/or an animated gif (you can use LICEcap to record a video) to provide more context with regards to how the UX has changed.

john-bodley · 2018-04-18T18:01:12Z

@timifasubaa if you update your description the following keywords can be use to automatically close issues.

timifasubaa · 2018-04-18T21:05:59Z

@john-bodley Updated both

codecov-io · 2018-04-24T04:56:24Z

Codecov Report

Merging #4834 into master will decrease coverage by 0.29%.
The diff coverage is 50.6%.

@@            Coverage Diff            @@
##           master    #4834     +/-   ##
=========================================
- Coverage   77.51%   77.22%   -0.3%     
=========================================
  Files          44       44             
  Lines        8735     8780     +45     
=========================================
+ Hits         6771     6780      +9     
- Misses       1964     2000     +36

Impacted Files	Coverage Δ
superset/utils.py	`89.08% <100%> (+0.55%)`	⬆️
superset/config.py	`92.42% <100%> (+0.11%)`	⬆️
superset/db_engine_specs.py	`51.8% <21.05%> (-2.19%)`	⬇️
superset/sql_lab.py	`69.36% <31.25%> (-6.05%)`	⬇️
superset/connectors/druid/models.py	`80.53% <0%> (ø)`	⬆️
superset/viz.py	`81.42% <0%> (+0.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 556ef44...8d54892. Read the comment docs.

john-bodley

A few general questions:

Currently this is defined for Presto but I was wondering whether it should be enabled for other databases as well. I presume any engine which supports fetchmany could in theory work.
What happens when the result set is less than 1,000 rows?
Can you also add some unit tests?

john-bodley · 2018-04-25T05:38:04Z

superset/db_engine_specs.py

+        payload = dict(query_id=query.id)
+        payload.update({
+            'status': utils.QueryStatus.PREFETCHED,
+            'data': cdf.data if cdf.data else [],


I don't think you need the if/else as cdf.data returns an empty array if the data frame is empty per this. Also x or [] is preferred over x if x else [].

john-bodley · 2018-04-25T05:44:20Z

superset/db_engine_specs.py

+        })
+
+        json_payload = json.dumps(payload, default=utils.json_iso_dttm_ser)
+        key = '{}'.format(uuid.uuid4())


No need to format this as a string as it's already a string.

john-bodley · 2018-04-25T05:48:14Z

superset/db_engine_specs.py


            query = session.query(type(query)).filter_by(id=query.id).one()
            if query.status == QueryStatus.STOPPED:
                cursor.cancel()
                break

+            if (
+                config.get('PREFETCH_PRESTO') and
+                processed_rows > 1000 and


Why is this a hard coded value? Shouldn't this be tied to the limit defined in the prefetch_results method?

john-bodley · 2018-04-25T05:49:34Z

superset/sql_lab.py

        'columns': cdf.columns if cdf.columns else [],
        'query': query.to_dict(),
    })
    if store_results:
-        key = '{}'.format(uuid.uuid4())
+        if not query.results_key:
+            query.results_key = '{}'.format(uuid.uuid4())


See note above.

john-bodley · 2018-04-25T05:50:38Z

superset/sql_lab.py

@@ -158,6 +136,9 @@ def handle_error(msg):

    if store_results and not results_backend:
        return handle_error("Results backend isn't configured.")
+        cache_timeout = database.cache_timeout


This will never execute and is duplicate logic from line 256.

john-bodley · 2018-04-25T05:54:13Z

superset/utils.py

+
+
+def get_original_key(prefetch_key):
+    return prefetch_key[:-9]


Seems a tad brittle.

john-bodley · 2018-04-25T05:55:22Z

superset/views/core.py

        blob = results_backend.get(key)
        if not blob:
            return json_error_response(
                'Data could not be retrieved. '
                'You may want to re-run the query.',
                status=410,
            )
+        if utils.is_prefetch_key(key):  # hack to not break when requesting prefetch


Having the term "hack" in a comment is asking for trouble. Maybe we could rethink this.

timifasubaa · 2018-05-19T02:29:14Z

The current approach incorporates the various feedback I have received on this PR.

It abandons the presto specific logic and uses fetchmany to get the prefetched results early and expands to any database that can fetchmany.
it writes the full data to the same bucket where the preloaded data was written before. This simplifies the logic.

mistercrunch

While this is a sweet feature, I think it adds a somewhat complex layer on top of something that's already fairly complex and hard to reason about.

I'm not saying we shouldn't do this, just saying that we should make it as straightforward/manageable as possible.

mistercrunch · 2018-06-01T15:50:17Z

superset/assets/src/SqlLab/components/ResultSet.jsx

+              <div>
+                {progressBar}
+              </div>
+              <VisualizeModal


This whole section is copy pasted from earlier in this very method. This should be refactored into its own component, or at least a renderDataSection method or something like that, that would receive different props/params as needed.

mistercrunch · 2018-06-01T15:53:42Z

superset/db_engine_specs.py

@@ -726,7 +728,29 @@ def extra_table_metadata(cls, database, table_name, schema_name):
        }

    @classmethod
-    def handle_cursor(cls, cursor, query, session):
+    def prefetch_results(cls, cursor, query, cache_timeout, session, limit):


It appears this method is not covered by tests

mistercrunch · 2018-06-01T15:56:24Z

superset/db_engine_specs.py

+        data = cursor.fetchmany(limit)
+        column_names = cls.get_normalized_column_names(cursor.description)
+        cdf = utils.convert_results_to_df(column_names, data)
+        payload = dict(query_id=query.id)


Much of the logic here is not specific to Presto and should probably live in the base class or outside this module. Maybe something like cache_prefetched_data(data).

mistercrunch · 2018-06-01T15:57:13Z

superset/config.py

@@ -294,6 +294,12 @@ class CeleryConfig(object):
 # Timeout duration for SQL Lab synchronous queries
 SQLLAB_TIMEOUT = 30

+# When set to true, results from asynchronous sql lab are prefetched
+PREFETCH_ASYNC = True


Should this be a db-level param?

mistercrunch · 2018-06-01T16:01:25Z

superset/db_engine_specs.py

+                (not query.has_loaded_early)
+            ):
+                query.has_loaded_early = True
+                limit = config.get('PREFETCH_ROWS')


prefetch_count would be a better name than limit as limit has different meaning around limiting the query itself.

mistercrunch · 2018-06-01T16:03:05Z

superset/db_engine_specs.py

@@ -72,8 +74,8 @@ class BaseEngineSpec(object):
    inner_joins = True

    @classmethod
-    def fetch_data(cls, cursor, limit):
-        if cls.limit_method == LimitMethod.FETCH_MANY:
+    def fetch_data(cls, cursor, limit, prefetch=False):


Does fetch_data need a new arg or does it just need to be called with a limit?

mistercrunch · 2018-06-01T16:30:06Z

superset/db_engine_specs.py

+            ):
+                query.has_loaded_early = True
+                limit = config.get('PREFETCH_ROWS')
+                PrestoEngineSpec.prefetch_results(


Is this possible that since fetch_many is synchronous, it will prevent publishing any query % progress until some rows are fetched? I think we may be loosing or obfuscating progress here. Say a query with a large groupby returning a small result set that takes 5 minutes to scan will show 0% until it can return the small result set. Now I also wonder how the UI does if/when the prefetch and final result occur right at around the same moment (large scan query with small result set), does it flicker, does it look ok?

timifasubaa changed the title ~~[WIP] Load async sql lab early for Presto~~ [WIP] Load async sql lab results early for Presto Apr 17, 2018

timifasubaa force-pushed the load_async_sql_lab_early branch 8 times, most recently from ee27e77 to dcbeeb2 Compare April 17, 2018 01:10

timifasubaa force-pushed the load_async_sql_lab_early branch 2 times, most recently from 11386cd to e629a3e Compare April 18, 2018 20:44

timifasubaa force-pushed the load_async_sql_lab_early branch 6 times, most recently from 628ea56 to 6970d1c Compare April 20, 2018 23:38

timifasubaa changed the title ~~[WIP] Load async sql lab results early for Presto~~ Load async sql lab results early for Presto Apr 20, 2018

timifasubaa force-pushed the load_async_sql_lab_early branch 9 times, most recently from 2606e22 to de04dea Compare April 24, 2018 04:40

timifasubaa force-pushed the load_async_sql_lab_early branch from de04dea to 0793b2f Compare April 24, 2018 05:41

john-bodley reviewed Apr 25, 2018

View reviewed changes

jeffreythewang mentioned this pull request May 7, 2018

Add separate limit setting for SqlLab #4941

Merged

timifasubaa force-pushed the load_async_sql_lab_early branch from 0793b2f to d93f393 Compare May 7, 2018 17:11

timifasubaa force-pushed the load_async_sql_lab_early branch 8 times, most recently from 6f72db0 to 5391cd2 Compare May 16, 2018 23:07

timifasubaa force-pushed the load_async_sql_lab_early branch from 5391cd2 to f6b9069 Compare May 31, 2018 18:17

betodealmeida and others added 2 commits May 31, 2018 15:21

Rename no_reload (apache#4703)

be897ac

prefetch asyncronous query results from presto

0d3dc52

timifasubaa force-pushed the load_async_sql_lab_early branch 8 times, most recently from 403058c to c7a3534 Compare June 1, 2018 05:27

change approach to reuse same results_key

8d54892

timifasubaa force-pushed the load_async_sql_lab_early branch from c7a3534 to 8d54892 Compare June 1, 2018 05:40

mistercrunch requested changes Jun 1, 2018

View reviewed changes

john-bodley mentioned this pull request Jun 5, 2018

Optimize presto SQL Lab query performance. #5132

Merged

timifasubaa closed this Jan 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load async sql lab results early for Presto #4834

Load async sql lab results early for Presto #4834

timifasubaa commented Apr 17, 2018 •

edited

Loading

john-bodley commented Apr 18, 2018 •

edited

Loading

john-bodley commented Apr 18, 2018

timifasubaa commented Apr 18, 2018

codecov-io commented Apr 24, 2018 •

edited

Loading

john-bodley left a comment •

edited

Loading

john-bodley Apr 25, 2018

john-bodley Apr 25, 2018

john-bodley Apr 25, 2018

john-bodley Apr 25, 2018

john-bodley Apr 25, 2018

john-bodley Apr 25, 2018

john-bodley Apr 25, 2018

timifasubaa commented May 19, 2018 •

edited

Loading

mistercrunch left a comment

mistercrunch Jun 1, 2018

mistercrunch Jun 1, 2018

mistercrunch Jun 1, 2018

mistercrunch Jun 1, 2018

mistercrunch Jun 1, 2018

mistercrunch Jun 1, 2018

mistercrunch Jun 1, 2018

Load async sql lab results early for Presto #4834

Load async sql lab results early for Presto #4834

Conversation

timifasubaa commented Apr 17, 2018 • edited Loading

john-bodley commented Apr 18, 2018 • edited Loading

john-bodley commented Apr 18, 2018

timifasubaa commented Apr 18, 2018

codecov-io commented Apr 24, 2018 • edited Loading

Codecov Report

john-bodley left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timifasubaa commented May 19, 2018 • edited Loading

mistercrunch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timifasubaa commented Apr 17, 2018 •

edited

Loading

john-bodley commented Apr 18, 2018 •

edited

Loading

codecov-io commented Apr 24, 2018 •

edited

Loading

john-bodley left a comment •

edited

Loading

timifasubaa commented May 19, 2018 •

edited

Loading