Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add retry to SQL-based alerting celery task #10542

Merged
merged 8 commits into from
Aug 10, 2020

Conversation

JasonD28
Copy link
Contributor

@JasonD28 JasonD28 commented Aug 6, 2020

SUMMARY

Multiple errors such as sqlalchemy.exc.NoSuchColumnError, sqlalchemy.exc.ResourceClosedError and MySQLdb._exceptions.OperationalError were being seen when multiple alert celery tasks were run concurrently. These errors occurred during the first few iterations of a celery beat schedule after initialization, however they would stop after after a few iterations of the beat schedule. This PR introduces retrying along with the celery task as a temporary fix for #10530. The use of db.session is also standardized in the alerting workflow.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TEST PLAN

  • local test
  • dropbox staging
  • dropbox production

ADDITIONAL INFORMATION

Copy link
Member

@dpgaspar dpgaspar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple of comments/doubts

else:
raise RuntimeError("Unknown report type")
except NoSuchColumnError as column_error:
stats_logger.incr("run_alert_task.failure.NoSuchColumnError")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the API follows a fixed pattern on statsd metrics: https://github.com/apache/incubator-superset/blob/master/superset/views/base_api.py#L173

So we use ..error|success|init. My take is that it's desirable to just incr an error counter (at the method level is granular enough) and also log the error with a more detailed message or exception

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case we want to differentiate between NoSuchColumnError & ResourceClosedError - still researching the error
s/failure/error - is a good idea
we should probably use snake case for the metric name

@@ -569,7 +583,7 @@ class AlertState:


def deliver_alert(alert_id: int, recipients: Optional[str] = None) -> None:
alert = db.session.query(Alert).get(alert_id)
alert = db.create_scoped_session().query(Alert).get(alert_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, @john-bodley submitted this: #10427

is this necessary because it's called by a celery task, and we want to make sure the session gets removed on task completion? What's the effect if this is called by a web request with proper session scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I'm actually going to revert this change and just use regular session. Both types of sessions cause errors with the underlying cause remaining unknown for now.


logger.info("Processing alert ID: %i", alert.id)
database = alert.database
dbsession = db.create_scoped_session()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we call it db_session

else:
raise RuntimeError("Unknown report type")
except NoSuchColumnError as column_error:
stats_logger.incr("run_alert_task.failure.NoSuchColumnError")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case we want to differentiate between NoSuchColumnError & ResourceClosedError - still researching the error
s/failure/error - is a good idea
we should probably use snake case for the metric name

@@ -533,6 +535,9 @@ def schedule_email_report( # pylint: disable=unused-argument
name="alerts.run_query",
bind=True,
soft_time_limit=config["EMAIL_ASYNC_TIME_LIMIT_SEC"],
autoretry_for=(NoSuchColumnError, ResourceClosedError,),
retry_kwargs={"max_retries": 5},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a todo and the link to the issue - to get rid of retry once the underlying issue is resolved

@bkyryliuk
Copy link
Member

left a couple of comments/doubts
@dpgaspar we cleaned up session changes & renamed metric to be error, however will keep the error type for some time to have better visibility into the root causes.

@bkyryliuk bkyryliuk merged commit 8b9292e into apache:master Aug 10, 2020
Ofeknielsen pushed a commit to ofekisr/incubator-superset that referenced this pull request Oct 5, 2020
* added retry and minimized sqlalchemy object lives

* pylint

* added try catch

* adjusted naming

* added scoped session

* update tests for dbsession

* added requested changes

* nit todo

Co-authored-by: Jason Davis <@dropbox.com>
auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020
* added retry and minimized sqlalchemy object lives

* pylint

* added try catch

* adjusted naming

* added scoped session

* update tests for dbsession

* added requested changes

* nit todo

Co-authored-by: Jason Davis <@dropbox.com>
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.38.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🚢 0.38.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants