-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on transient errors #3303
Comments
@jweibel22 Thanks for opening! I see two ways in which dbt could "retry" on transient errors:
I think the first is preferable, if possible. Ultimately, both are worth doing. The mileage will vary significantly by adapter; as you say, the first and biggest challenge is reliably identifying which errors are transient and worth retrying. In fact, If we can identify similar error codes for In any case, I think the |
@jtcohen6 what if instead of trying to identify all the transient errors for all the different connectors, we start by allowing a list of exceptions to be defined in the profiles.yml that you want to retry on? |
I'm going to leave this in the current repo, but since the required changes are adapter-specific, it's likely that we'll want to open issues in the adapter repos as well. (In the case of Redshift, too, the change would probably involve |
Would love this functionality as well. Here's a specific use case. We are using RDS Aurora Serverless, which has auto-scaling built-in. When it starts to scale it kicks off connections. And it starts to scale every time there is load, which is often the case when refreshing lots of dbt models at once. And then it kicks dbt out and the entire run fails because of maybe one or two failed models, as they were disconnected. And our runs happen every few hours. By then RDS had scaled down already. And when dbt runs next time, it fails again. In the end we have pretty much every run fail in some way. The only time it does not fail when the cluster was already scaled out due to some other jobs triggering the scaling event. To work around that we'll implement forced scaling before dbt runs. But I wish there was just a way to retry certain errors, like disconnects. |
I stumbled across this issue as I was debugging some sporadic timeouts when running dbt on our Redshift cluster. I'm commenting to reinforce the need for this issue with my current situation: From the Redshift connection logs I can see when dbt attempts to connect, and these attempts are successful. However, after about 10 seconds, dbt reports a timeout error (consistent with the default Having the ability to immediately retry when connecting would be very useful for us. At the time, we are relying on the Thanks for everything! |
👍 on this issue, we are hitting this with Databricks too |
Same issue here using sources that are secured views provided through a Snowflake datashare. Once every 24 hours, the views are dropped and re-created. I don't know why it doesn't happen in a single transaction but the net result is that the views are unavailable for a few seconds. If a job relying on that source happens to be running at that time, it will fail and return an error. Either a retry logic or any other way to handle this situation gracefully would be great! |
Some sort of retry at the model level would be helpful. I have a model in Amazon Aurora PostgreSQL that calls a SageMaker endpoint for ML inferences. Sometimes there are invocation timeouts from the endpoint being too busy, resulting in a failed dbt model. The endpoint autoscales, so simply re-running the model usually works. I would really like some sort of option that I can apply to specific models, to retry a model N times before considering it as failed. |
Hi, how did you'll get through this? We are facing the similar issue on Redshift now when random reboots happen. We would love to have the retry option in dbt cloud jobs.
|
There is the |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
Our use case is similar to @kinghuang: we have some specific models that we would like to retry upon failure (any failure). We have UDFs calling external services, and so many different transient issues can happen (service issues, network issues, etc.). The Ideally, we'd be looking for something like this:
The above would mean: if the model fails, wait 30 seconds and then try again, up to a maximum of two times. It's a fairly simple feature, and it would benefit everyone. |
Describe the feature
This has come up in here before. Sometimes transient errors occurs and it would be nice if dbt could automatically retry on those occasions. Right now, those transient errors causes our nightly dbt pipeline to fail, which is blocking downstream pipelines.
Specifically we're dealing with some hard to track dns resolution problems in our network setup which causes some flakiness.
Describe alternatives you've considered
Our dbt pipeline is run from an airflow DAG and the only way to retry is to re-run the DAG which runs the entire dbt pipeline. We could implement better support for running only specific models in our production environment so we can fix the problem faster, however this would still require manual work and cause a significant delay as it wouldn't take place until someone notices the problem in the morning.
Additional context
We're using redshift, but the problem is not related to a specific database.
Who will this benefit?
Everyone that deals with networking and other operational issues.
Are you interested in contributing this feature?
We don't mind contributing on this.
The first problem is identifying which errors are transient (worth retrying) and which are not: https://www.psycopg.org/docs/errors.html
It might be a good idea to leave this decision to the user and let the user configure a retry strategy (but provide a sensible default)
The text was updated successfully, but these errors were encountered: