Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on transient errors #3303

Closed
jweibel22 opened this issue Apr 28, 2021 · 15 comments
Closed

Retry on transient errors #3303

jweibel22 opened this issue Apr 28, 2021 · 15 comments
Labels
enhancement New feature or request paper_cut A small change that impacts lots of users in their day-to-day postgres stale Issues that have gone stale Team:Adapters Issues designated for the adapter area of the code

Comments

@jweibel22
Copy link
Contributor

Describe the feature

This has come up in here before. Sometimes transient errors occurs and it would be nice if dbt could automatically retry on those occasions. Right now, those transient errors causes our nightly dbt pipeline to fail, which is blocking downstream pipelines.

Specifically we're dealing with some hard to track dns resolution problems in our network setup which causes some flakiness.

Describe alternatives you've considered

Our dbt pipeline is run from an airflow DAG and the only way to retry is to re-run the DAG which runs the entire dbt pipeline. We could implement better support for running only specific models in our production environment so we can fix the problem faster, however this would still require manual work and cause a significant delay as it wouldn't take place until someone notices the problem in the morning.

Additional context

We're using redshift, but the problem is not related to a specific database.

Who will this benefit?

Everyone that deals with networking and other operational issues.

Are you interested in contributing this feature?

We don't mind contributing on this.
The first problem is identifying which errors are transient (worth retrying) and which are not: https://www.psycopg.org/docs/errors.html
It might be a good idea to leave this decision to the user and let the user configure a retry strategy (but provide a sensible default)

@jweibel22 jweibel22 added enhancement New feature or request triage labels Apr 28, 2021
@jtcohen6 jtcohen6 added redshift and removed triage labels May 10, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented May 10, 2021

@jweibel22 Thanks for opening!

I see two ways in which dbt could "retry" on transient errors:

  1. Query-level: If a query returns a transient error from the database, dbt's connection logic should handle that error and retry the query.
  2. Invocation-level: If a single dbt model failed, rather than re-running the entire DAG, it should be possible to select and re-run just the models with status:error or status:skipped in run_results.json (Add pseudo selectors that select models based on artifact states #2465).

I think the first is preferable, if possible. Ultimately, both are worth doing.

The mileage will vary significantly by adapter; as you say, the first and biggest challenge is reliably identifying which errors are transient and worth retrying. In fact, dbt-bigquery already has logic for this today, since google.api_core offers a retry helper (docs).

https://github.com/fishtown-analytics/dbt/blob/8d39ef16b67fcc3c30e8a18751b4e64831ec9aaf/plugins/bigquery/dbt/adapters/bigquery/connections.py#L44-L49

https://github.com/fishtown-analytics/dbt/blob/8d39ef16b67fcc3c30e8a18751b4e64831ec9aaf/plugins/bigquery/dbt/adapters/bigquery/connections.py#L337

If we can identify similar error codes for psycopg2, we could implement this for dbt-postgres + dbt-redshift as well. (I found a thread here that may hold promise.) Or, if Amazon's new python driver offers more reliable + descriptive error codes, it's one more good reason to consider switching.

In any case, I think the execute method, or a method called by it, is probably the right place to implement this logic.

@alecrubin
Copy link

alecrubin commented Jun 11, 2021

@jtcohen6 what if instead of trying to identify all the transient errors for all the different connectors, we start by allowing a list of exceptions to be defined in the profiles.yml that you want to retry on?

@jtcohen6
Copy link
Contributor

I'm going to leave this in the current repo, but since the required changes are adapter-specific, it's likely that we'll want to open issues in the adapter repos as well.

(In the case of Redshift, too, the change would probably involve connections.py in dbt-postgres, which is still in this repo)

@moltar
Copy link

moltar commented Nov 5, 2021

Would love this functionality as well.

Here's a specific use case.

We are using RDS Aurora Serverless, which has auto-scaling built-in.

When it starts to scale it kicks off connections.

And it starts to scale every time there is load, which is often the case when refreshing lots of dbt models at once.

And then it kicks dbt out and the entire run fails because of maybe one or two failed models, as they were disconnected.

And our runs happen every few hours. By then RDS had scaled down already. And when dbt runs next time, it fails again.

In the end we have pretty much every run fail in some way. The only time it does not fail when the cluster was already scaled out due to some other jobs triggering the scaling event.

To work around that we'll implement forced scaling before dbt runs. But I wish there was just a way to retry certain errors, like disconnects.

@tomasfarias
Copy link
Contributor

tomasfarias commented May 10, 2022

I stumbled across this issue as I was debugging some sporadic timeouts when running dbt on our Redshift cluster. I'm commenting to reinforce the need for this issue with my current situation:

From the Redshift connection logs I can see when dbt attempts to connect, and these attempts are successful. However, after about 10 seconds, dbt reports a timeout error (consistent with the default connection_timeout parameter from the postgres adapter). Moreover, the Redshift connection logs also show that the connections dbt started stay up for longer than 10 seconds, which indicates this could be due to transient network problems.

Having the ability to immediately retry when connecting would be very useful for us. At the time, we are relying on the status:error workaround/feature but this requires the overhead of restarting our dbt tasks.

Thanks for everything!

@francescomucio
Copy link

👍 on this issue, we are hitting this with Databricks too

@xg1990
Copy link

xg1990 commented Aug 10, 2022

same problem happened to us. (with Databricks)

image

@jtcohen6 jtcohen6 added the paper_cut A small change that impacts lots of users in their day-to-day label Nov 28, 2022
@raphaelvarieras
Copy link

Same issue here using sources that are secured views provided through a Snowflake datashare. Once every 24 hours, the views are dropped and re-created. I don't know why it doesn't happen in a single transaction but the net result is that the views are unavailable for a few seconds. If a job relying on that source happens to be running at that time, it will fail and return an error.

Either a retry logic or any other way to handle this situation gracefully would be great!

@kinghuang
Copy link

Some sort of retry at the model level would be helpful. I have a model in Amazon Aurora PostgreSQL that calls a SageMaker endpoint for ML inferences. Sometimes there are invocation timeouts from the endpoint being too busy, resulting in a failed dbt model. The endpoint autoscales, so simply re-running the model usually works.

I would really like some sort of option that I can apply to specific models, to retry a model N times before considering it as failed.

@dharitsura248
Copy link

Hi, How did you'll get through this? Apparently, we are also receiving the below error. And this is intermittent as well, since it works fine the very time the same model is triggered.
image

@yuna-tang
Copy link

Hi, how did you'll get through this? We are facing the similar issue on Redshift now when random reboots happen. We would love to have the retry option in dbt cloud jobs.

Database Error in model data_inspection (models/exposure/tableau/product/data_inspection.sql)
16:28:24 SSL SYSCALL error: EOF detected
16:28:24 compiled Code at target/run/analytics/models/exposure/tableau/product/data_inspection.sql
could not connect to server: Connection refused

@kinghuang
Copy link

There is the dbt retry command since 1.6.0. See #7299 and About dbt retry command.

Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Apr 2, 2024
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 10, 2024
@mroy-seedbox
Copy link

Our use case is similar to @kinghuang: we have some specific models that we would like to retry upon failure (any failure). We have UDFs calling external services, and so many different transient issues can happen (service issues, network issues, etc.).

The dbt retry command is not ideal as it involves some overhead in order to orchestrate the retries.

Ideally, we'd be looking for something like this:

{{ config(
  retries=2,
  retry_delay=30
) }}

The above would mean: if the model fails, wait 30 seconds and then try again, up to a maximum of two times.

It's a fairly simple feature, and it would benefit everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request paper_cut A small change that impacts lots of users in their day-to-day postgres stale Issues that have gone stale Team:Adapters Issues designated for the adapter area of the code
Projects
None yet
Development

No branches or pull requests