Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load async sql lab results early for Presto #4834

Closed

Conversation

timifasubaa
Copy link
Contributor

@timifasubaa timifasubaa commented Apr 17, 2018

This PR enables sqllab to load queries early after the first 100 rows are available. Since this uses fetchmany, it should be possible for all our dbs but this PR implements it for just Presto.

This has the benefit of makingusers wait less before seeing questy results and also enabling them iterate on their query quicker and also visualize much earlier on in the process.

screenshot

Closes #4588

At a high level, the celery worker loads the first 100 results into an s3 bicket and loads it. When the rest of the data is ready, it combines the old and the new and puts it in the same s3 bucket.

@graceguo-supercat @john-bodley @mistercrunch @williaster

@timifasubaa timifasubaa changed the title [WIP] Load async sql lab early for Presto [WIP] Load async sql lab results early for Presto Apr 17, 2018
@timifasubaa timifasubaa force-pushed the load_async_sql_lab_early branch 8 times, most recently from ee27e77 to dcbeeb2 Compare April 17, 2018 01:10
@john-bodley
Copy link
Member

john-bodley commented Apr 18, 2018

@timifasubaa would you mind including screenshots and/or an animated gif (you can use LICEcap to record a video) to provide more context with regards to how the UX has changed.

@john-bodley
Copy link
Member

@timifasubaa if you update your description the following keywords can be use to automatically close issues.

@timifasubaa timifasubaa force-pushed the load_async_sql_lab_early branch 2 times, most recently from 11386cd to e629a3e Compare April 18, 2018 20:44
@timifasubaa
Copy link
Contributor Author

@john-bodley Updated both

@timifasubaa timifasubaa force-pushed the load_async_sql_lab_early branch 6 times, most recently from 628ea56 to 6970d1c Compare April 20, 2018 23:38
@timifasubaa timifasubaa changed the title [WIP] Load async sql lab results early for Presto Load async sql lab results early for Presto Apr 20, 2018
@timifasubaa timifasubaa force-pushed the load_async_sql_lab_early branch 9 times, most recently from 2606e22 to de04dea Compare April 24, 2018 04:40
@codecov-io
Copy link

codecov-io commented Apr 24, 2018

Codecov Report

Merging #4834 into master will decrease coverage by 0.29%.
The diff coverage is 50.6%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #4834     +/-   ##
=========================================
- Coverage   77.51%   77.22%   -0.3%     
=========================================
  Files          44       44             
  Lines        8735     8780     +45     
=========================================
+ Hits         6771     6780      +9     
- Misses       1964     2000     +36
Impacted Files Coverage Δ
superset/utils.py 89.08% <100%> (+0.55%) ⬆️
superset/config.py 92.42% <100%> (+0.11%) ⬆️
superset/db_engine_specs.py 51.8% <21.05%> (-2.19%) ⬇️
superset/sql_lab.py 69.36% <31.25%> (-6.05%) ⬇️
superset/connectors/druid/models.py 80.53% <0%> (ø) ⬆️
superset/viz.py 81.42% <0%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 556ef44...8d54892. Read the comment docs.

Copy link
Member

@john-bodley john-bodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few general questions:

  1. Currently this is defined for Presto but I was wondering whether it should be enabled for other databases as well. I presume any engine which supports fetchmany could in theory work.
  2. What happens when the result set is less than 1,000 rows?
  3. Can you also add some unit tests?

payload = dict(query_id=query.id)
payload.update({
'status': utils.QueryStatus.PREFETCHED,
'data': cdf.data if cdf.data else [],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the if/else as cdf.data returns an empty array if the data frame is empty per this. Also x or [] is preferred over x if x else [].

})

json_payload = json.dumps(payload, default=utils.json_iso_dttm_ser)
key = '{}'.format(uuid.uuid4())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to format this as a string as it's already a string.


query = session.query(type(query)).filter_by(id=query.id).one()
if query.status == QueryStatus.STOPPED:
cursor.cancel()
break

if (
config.get('PREFETCH_PRESTO') and
processed_rows > 1000 and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a hard coded value? Shouldn't this be tied to the limit defined in the prefetch_results method?

'columns': cdf.columns if cdf.columns else [],
'query': query.to_dict(),
})
if store_results:
key = '{}'.format(uuid.uuid4())
if not query.results_key:
query.results_key = '{}'.format(uuid.uuid4())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See note above.

@@ -158,6 +136,9 @@ def handle_error(msg):

if store_results and not results_backend:
return handle_error("Results backend isn't configured.")
cache_timeout = database.cache_timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will never execute and is duplicate logic from line 256.



def get_original_key(prefetch_key):
return prefetch_key[:-9]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a tad brittle.

blob = results_backend.get(key)
if not blob:
return json_error_response(
'Data could not be retrieved. '
'You may want to re-run the query.',
status=410,
)
if utils.is_prefetch_key(key): # hack to not break when requesting prefetch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the term "hack" in a comment is asking for trouble. Maybe we could rethink this.

@timifasubaa timifasubaa force-pushed the load_async_sql_lab_early branch 8 times, most recently from 6f72db0 to 5391cd2 Compare May 16, 2018 23:07
@timifasubaa
Copy link
Contributor Author

timifasubaa commented May 19, 2018

The current approach incorporates the various feedback I have received on this PR.

  1. It abandons the presto specific logic and uses fetchmany to get the prefetched results early and expands to any database that can fetchmany.
  2. it writes the full data to the same bucket where the preloaded data was written before. This simplifies the logic.

@timifasubaa timifasubaa force-pushed the load_async_sql_lab_early branch 8 times, most recently from 403058c to c7a3534 Compare June 1, 2018 05:27
Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is a sweet feature, I think it adds a somewhat complex layer on top of something that's already fairly complex and hard to reason about.

I'm not saying we shouldn't do this, just saying that we should make it as straightforward/manageable as possible.

<div>
{progressBar}
</div>
<VisualizeModal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole section is copy pasted from earlier in this very method. This should be refactored into its own component, or at least a renderDataSection method or something like that, that would receive different props/params as needed.

@@ -726,7 +728,29 @@ def extra_table_metadata(cls, database, table_name, schema_name):
}

@classmethod
def handle_cursor(cls, cursor, query, session):
def prefetch_results(cls, cursor, query, cache_timeout, session, limit):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears this method is not covered by tests

data = cursor.fetchmany(limit)
column_names = cls.get_normalized_column_names(cursor.description)
cdf = utils.convert_results_to_df(column_names, data)
payload = dict(query_id=query.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of the logic here is not specific to Presto and should probably live in the base class or outside this module. Maybe something like cache_prefetched_data(data).

@@ -294,6 +294,12 @@ class CeleryConfig(object):
# Timeout duration for SQL Lab synchronous queries
SQLLAB_TIMEOUT = 30

# When set to true, results from asynchronous sql lab are prefetched
PREFETCH_ASYNC = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a db-level param?

(not query.has_loaded_early)
):
query.has_loaded_early = True
limit = config.get('PREFETCH_ROWS')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefetch_count would be a better name than limit as limit has different meaning around limiting the query itself.

@@ -72,8 +74,8 @@ class BaseEngineSpec(object):
inner_joins = True

@classmethod
def fetch_data(cls, cursor, limit):
if cls.limit_method == LimitMethod.FETCH_MANY:
def fetch_data(cls, cursor, limit, prefetch=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does fetch_data need a new arg or does it just need to be called with a limit?

):
query.has_loaded_early = True
limit = config.get('PREFETCH_ROWS')
PrestoEngineSpec.prefetch_results(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this possible that since fetch_many is synchronous, it will prevent publishing any query % progress until some rows are fetched? I think we may be loosing or obfuscating progress here. Say a query with a large groupby returning a small result set that takes 5 minutes to scan will show 0% until it can return the small result set. Now I also wonder how the UI does if/when the prefetch and final result occur right at around the same moment (large scan query with small result set), does it flicker, does it look ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants