Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: report with timeout chart #16674

Conversation

graceguo-supercat
Copy link

@graceguo-supercat graceguo-supercat commented Sep 11, 2021

SUMMARY

A user schedule may an email report for a dashboard in the future time. When webdriver takes screeshot of the dashboard, one or more of its charts may not well cached, and may take pretty long time to query. Current webdriver wait timeout is 60 seconds. If one of charts didn't get rendered after timeout, currently Superset will report an email with following error message but no screenshot of dashboard.

Error: Report Schedule execution failed when generating a screenshot.

To most of dashboard report users, a report with a couple of timeout chart (showing spinner) is still acceptable, much better than an error message without report. Could we still send report with timeout chart? @dpgaspar @betodealmeida @eschutho

TESTING INSTRUCTIONS

CI and manual test

@codecov
Copy link

codecov bot commented Sep 11, 2021

Codecov Report

Merging #16674 (45c94ee) into master (4dc859f) will decrease coverage by 0.00%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16674      +/-   ##
==========================================
- Coverage   76.90%   76.90%   -0.01%     
==========================================
  Files        1005     1005              
  Lines       54007    54008       +1     
  Branches     7337     7337              
==========================================
  Hits        41536    41536              
- Misses      12231    12232       +1     
  Partials      240      240              
Flag Coverage Δ
hive 81.19% <0.00%> (-0.01%) ⬇️
mysql 81.65% <0.00%> (-0.01%) ⬇️
postgres 81.68% <0.00%> (-0.01%) ⬇️
presto 81.47% <0.00%> (-0.05%) ⬇️
python 82.18% <0.00%> (-0.01%) ⬇️
sqlite 81.33% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/utils/webdriver.py 80.89% <0.00%> (-0.92%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4dc859f...45c94ee. Read the comment docs.

@eschutho
Copy link
Member

@graceguo-supercat we're actually looking into getting more up-to-date data on the charts by always refreshing the data and not using cached results. I wonder, given this information if we could set a cache timeout on reports for something like 30 minutes and instead of erroring after 60 seconds either lengthening the timeout or retrying the dashboard after an error. It seems like we should deliver all of the charts to the users, and if we have to keep trying until the charts are all loaded, that should be ok, since it's an async operation anyway. WDYT?

@graceguo-supercat
Copy link
Author

graceguo-supercat commented Sep 13, 2021

  • i like the idea about invalidate cache for report.
  • report cache and dashboard cache should be able to set separately.
  • I am not sure retry can solve this problem. Let say i have a slow query always taking ~100 seconds to run. During report time, webdriver triggers the query, after 60 seconds it got timeout exception, then the dashboard report will be hold for retry? And after webdriver triggered the 2nd try, it could not get the query results from 1st try at 100 seconds, right? So when is the best time to trigger next try?
  • currently we have 60 second timeout, can we have extra config allow wait longer? As a report, in my opinion, the queries should be verified and optimized, so that each chart/query should no longer than a limit. 60 second seems a little tough, maybe 120 seconds is good enough (without retry)?

@john-bodley
Copy link
Member

@eschutho regardless of the timeout strategy are you supportive of at least capturing the screenshot per the PR?

@betodealmeida
Copy link
Member

I think this PR is an improvement, regardless of the strategy we choose.

@@ -141,6 +141,7 @@ def get_screenshot(
img = element.screenshot_as_png
except TimeoutException:
logger.warning("Selenium timed out requesting url %s", url, exc_info=True)
img = element.screenshot_as_png
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it always possible to call element.screenshot_as_png without raising when a TimeoutException occurs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it's webdriver raise TimeoutException right? i am not sure how to not raise it...

@dpgaspar
Copy link
Member

  • i like the idea about invalidate cache for report.
  • report cache and dashboard cache should be able to set separately.

+1

  • I am not sure retry can solve this problem. Let say i have a slow query always taking ~100 seconds to run. During report time, webdriver triggers the query, after 60 seconds it got timeout exception, then the dashboard report will be hold for retry? And after webdriver triggered the 2nd try, it could not get the query results from 1st try at 100 seconds, right? So when is the best time to trigger next try?
  • currently we have 60 second timeout, can we have extra config allow wait longer? As a report, in my opinion, the queries should be verified and optimized, so that each chart/query should no longer than a limit. 60 second seems a little tough, maybe 120 seconds is good enough (without retry)?

you mean besides SCREENSHOT_LOCATE_WAIT and SCREENSHOT_LOAD_WAIT?

@eschutho
Copy link
Member

eschutho commented Sep 14, 2021

@eschutho regardless of the timeout strategy are you supportive of at least capturing the screenshot per the PR?

yes, definitely. This is an good win for now.

@graceguo-supercat
Copy link
Author

you mean besides SCREENSHOT_LOCATE_WAIT and SCREENSHOT_LOAD_WAIT?

that's what i need! I didn't realize these wait time already configurable. Thanks!

@eschutho
Copy link
Member

  • I am not sure retry can solve this problem. Let say i have a slow query always taking ~100 seconds to run. During report time, webdriver triggers the query, after 60 seconds it got timeout exception, then the dashboard report will be hold for retry? And after webdriver triggered the 2nd try, it could not get the query results from 1st try at 100 seconds, right? So when is the best time to trigger next try?

Is 100 seconds with the data cached or uncached? If we set a configurable report cache buffer for example, and your longest query is at 30 seconds cached and 100 seconds uncached, and your total waits were 1 min, you could set the buffer for half an hour or so and the first time it loads it will fail and retry and be mostly cached. You should keep getting lower times until it it is all cached and you hit below the 60 sec mark.

Basically I think cache-busting the reports is good but within configurable limits, depending on the needs of the organization.

@graceguo-supercat
Copy link
Author

you could set the buffer for half an hour or so and the first time it loads it will fail and retry and be mostly cached.

When did re-try happen? right after 1st fail or after cache expired? If i set schedule to send report at 7:00, and 1st try failed, did i get an email at 7:00? I think we probably should discuss cache/retry in its own PR or thread.

This PR is to make sure sending out report even with chart timeout.

@eschutho
Copy link
Member

you could set the buffer for half an hour or so and the first time it loads it will fail and retry and be mostly cached.

When did re-try happen? right after 1st fail or after cache expired? If i set schedule to send report at 7:00, and 1st try failed, did i get an email at 7:00? I think we probably should discuss cache/retry in its own PR or thread.

This PR is to make sure sending out report even with chart timeout.

Yes, that discussion isn't related to your pr.. we can continue in a different thread! 👍

@graceguo-supercat graceguo-supercat merged commit 00ca21e into apache:master Sep 15, 2021
opus-42 pushed a commit to opus-42/incubator-superset that referenced this pull request Nov 14, 2021
QAlexBall pushed a commit to QAlexBall/superset that referenced this pull request Dec 28, 2021
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.4.0 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/XS 🚢 1.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants