Retry on transient errors #3303

jweibel22 · 2021-04-28T06:56:22Z

Describe the feature

This has come up in here before. Sometimes transient errors occurs and it would be nice if dbt could automatically retry on those occasions. Right now, those transient errors causes our nightly dbt pipeline to fail, which is blocking downstream pipelines.

Specifically we're dealing with some hard to track dns resolution problems in our network setup which causes some flakiness.

Describe alternatives you've considered

Our dbt pipeline is run from an airflow DAG and the only way to retry is to re-run the DAG which runs the entire dbt pipeline. We could implement better support for running only specific models in our production environment so we can fix the problem faster, however this would still require manual work and cause a significant delay as it wouldn't take place until someone notices the problem in the morning.

Additional context

We're using redshift, but the problem is not related to a specific database.

Who will this benefit?

Everyone that deals with networking and other operational issues.

Are you interested in contributing this feature?

We don't mind contributing on this.
The first problem is identifying which errors are transient (worth retrying) and which are not: https://www.psycopg.org/docs/errors.html
It might be a good idea to leave this decision to the user and let the user configure a retry strategy (but provide a sensible default)

jtcohen6 · 2021-05-10T22:21:08Z

@jweibel22 Thanks for opening!

I see two ways in which dbt could "retry" on transient errors:

Query-level: If a query returns a transient error from the database, dbt's connection logic should handle that error and retry the query.
Invocation-level: If a single dbt model failed, rather than re-running the entire DAG, it should be possible to select and re-run just the models with status:error or status:skipped in run_results.json (Add pseudo selectors that select models based on artifact states #2465).

I think the first is preferable, if possible. Ultimately, both are worth doing.

The mileage will vary significantly by adapter; as you say, the first and biggest challenge is reliably identifying which errors are transient and worth retrying. In fact, dbt-bigquery already has logic for this today, since google.api_core offers a retry helper (docs).

https://github.com/fishtown-analytics/dbt/blob/8d39ef16b67fcc3c30e8a18751b4e64831ec9aaf/plugins/bigquery/dbt/adapters/bigquery/connections.py#L44-L49

https://github.com/fishtown-analytics/dbt/blob/8d39ef16b67fcc3c30e8a18751b4e64831ec9aaf/plugins/bigquery/dbt/adapters/bigquery/connections.py#L337

If we can identify similar error codes for psycopg2, we could implement this for dbt-postgres + dbt-redshift as well. (I found a thread here that may hold promise.) Or, if Amazon's new python driver offers more reliable + descriptive error codes, it's one more good reason to consider switching.

In any case, I think the execute method, or a method called by it, is probably the right place to implement this logic.

alecrubin · 2021-06-11T14:50:38Z

@jtcohen6 what if instead of trying to identify all the transient errors for all the different connectors, we start by allowing a list of exceptions to be defined in the profiles.yml that you want to retry on?

jtcohen6 · 2021-10-12T13:57:17Z

I'm going to leave this in the current repo, but since the required changes are adapter-specific, it's likely that we'll want to open issues in the adapter repos as well.

(In the case of Redshift, too, the change would probably involve connections.py in dbt-postgres, which is still in this repo)

moltar · 2021-11-05T06:42:10Z

Would love this functionality as well.

Here's a specific use case.

We are using RDS Aurora Serverless, which has auto-scaling built-in.

When it starts to scale it kicks off connections.

And it starts to scale every time there is load, which is often the case when refreshing lots of dbt models at once.

And then it kicks dbt out and the entire run fails because of maybe one or two failed models, as they were disconnected.

And our runs happen every few hours. By then RDS had scaled down already. And when dbt runs next time, it fails again.

In the end we have pretty much every run fail in some way. The only time it does not fail when the cluster was already scaled out due to some other jobs triggering the scaling event.

To work around that we'll implement forced scaling before dbt runs. But I wish there was just a way to retry certain errors, like disconnects.

tomasfarias · 2022-05-10T11:45:44Z

I stumbled across this issue as I was debugging some sporadic timeouts when running dbt on our Redshift cluster. I'm commenting to reinforce the need for this issue with my current situation:

From the Redshift connection logs I can see when dbt attempts to connect, and these attempts are successful. However, after about 10 seconds, dbt reports a timeout error (consistent with the default connection_timeout parameter from the postgres adapter). Moreover, the Redshift connection logs also show that the connections dbt started stay up for longer than 10 seconds, which indicates this could be due to transient network problems.

Having the ability to immediately retry when connecting would be very useful for us. At the time, we are relying on the status:error workaround/feature but this requires the overhead of restarting our dbt tasks.

Thanks for everything!

francescomucio · 2022-05-24T08:04:06Z

👍 on this issue, we are hitting this with Databricks too

xg1990 · 2022-08-10T23:47:23Z

same problem happened to us. (with Databricks)

raphaelvarieras · 2022-12-26T21:40:04Z

Same issue here using sources that are secured views provided through a Snowflake datashare. Once every 24 hours, the views are dropped and re-created. I don't know why it doesn't happen in a single transaction but the net result is that the views are unavailable for a few seconds. If a job relying on that source happens to be running at that time, it will fail and return an error.

Either a retry logic or any other way to handle this situation gracefully would be great!

kinghuang · 2023-04-06T17:34:32Z

Some sort of retry at the model level would be helpful. I have a model in Amazon Aurora PostgreSQL that calls a SageMaker endpoint for ML inferences. Sometimes there are invocation timeouts from the endpoint being too busy, resulting in a failed dbt model. The endpoint autoscales, so simply re-running the model usually works.

I would really like some sort of option that I can apply to specific models, to retry a model N times before considering it as failed.

dharitsura248 · 2023-05-23T17:35:33Z

Hi, How did you'll get through this? Apparently, we are also receiving the below error. And this is intermittent as well, since it works fine the very time the same model is triggered.

yuna-tang · 2023-10-04T21:32:42Z

Hi, how did you'll get through this? We are facing the similar issue on Redshift now when random reboots happen. We would love to have the retry option in dbt cloud jobs.

Database Error in model data_inspection (models/exposure/tableau/product/data_inspection.sql)
16:28:24 SSL SYSCALL error: EOF detected
16:28:24 compiled Code at target/run/analytics/models/exposure/tableau/product/data_inspection.sql
could not connect to server: Connection refused

kinghuang · 2023-10-04T22:20:35Z

There is the dbt retry command since 1.6.0. See #7299 and About dbt retry command.

github-actions · 2024-04-02T01:46:41Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2024-04-10T01:45:57Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

mroy-seedbox · 2024-05-03T19:49:12Z

Our use case is similar to @kinghuang: we have some specific models that we would like to retry upon failure (any failure). We have UDFs calling external services, and so many different transient issues can happen (service issues, network issues, etc.).

The dbt retry command is not ideal as it involves some overhead in order to orchestrate the retries.

Ideally, we'd be looking for something like this:

{{ config(
  retries=2,
  retry_delay=30
) }}

The above would mean: if the model fails, wait 30 seconds and then try again, up to a maximum of two times.

It's a fairly simple feature, and it would benefit everyone.

jweibel22 added enhancement New feature or request triage labels Apr 28, 2021

jtcohen6 added redshift and removed triage labels May 10, 2021

jtcohen6 mentioned this issue Jul 21, 2021

Allowing for steps to retry #1630

Closed

jtcohen6 mentioned this issue Sep 26, 2021

Checkpointing in dbt #3891

Closed

jtcohen6 added postgres and removed redshift labels Oct 12, 2021

jtcohen6 mentioned this issue Oct 12, 2021

Timeout issue with RA3 nodes #3628

Closed

5 tasks

jtcohen6 added the Team:Adapters Issues designated for the adapter area of the code label Nov 18, 2021

jtcohen6 mentioned this issue Jan 4, 2022

[Bug] Exit codes do not match documentation #4479

Closed

1 task

nathaniel-may mentioned this issue Jan 25, 2022

[CT-75] [Feature] Per-query retry strategy for all adapters (currently available only for BigQuery) #4598

Closed

1 task

jtcohen6 mentioned this issue Mar 15, 2022

[CT-248] Implement retryable errors for Spark/Databricks dbt-labs/dbt-spark#293

Closed

jtcohen6 mentioned this issue Apr 11, 2022

[CT-472] postgres: Retry n times on connection timeout #5022

Closed

1 task

jtcohen6 added the paper_cut A small change that impacts lots of users in their day-to-day label Nov 28, 2022

github-actions bot added the triage label May 23, 2023

dbeatty10 removed the triage label May 23, 2023

github-actions bot added the stale Issues that have gone stale label Apr 2, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on transient errors #3303

Retry on transient errors #3303

jweibel22 commented Apr 28, 2021

jtcohen6 commented May 10, 2021 •

edited

Loading

alecrubin commented Jun 11, 2021 •

edited

Loading

jtcohen6 commented Oct 12, 2021

moltar commented Nov 5, 2021

tomasfarias commented May 10, 2022 •

edited

Loading

francescomucio commented May 24, 2022

xg1990 commented Aug 10, 2022

raphaelvarieras commented Dec 26, 2022

kinghuang commented Apr 6, 2023

dharitsura248 commented May 23, 2023

yuna-tang commented Oct 4, 2023

kinghuang commented Oct 4, 2023

github-actions bot commented Apr 2, 2024

github-actions bot commented Apr 10, 2024

mroy-seedbox commented May 3, 2024

Retry on transient errors #3303

Retry on transient errors #3303

Comments

jweibel22 commented Apr 28, 2021

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

jtcohen6 commented May 10, 2021 • edited Loading

alecrubin commented Jun 11, 2021 • edited Loading

jtcohen6 commented Oct 12, 2021

moltar commented Nov 5, 2021

tomasfarias commented May 10, 2022 • edited Loading

francescomucio commented May 24, 2022

xg1990 commented Aug 10, 2022

raphaelvarieras commented Dec 26, 2022

kinghuang commented Apr 6, 2023

dharitsura248 commented May 23, 2023

yuna-tang commented Oct 4, 2023

kinghuang commented Oct 4, 2023

github-actions bot commented Apr 2, 2024

github-actions bot commented Apr 10, 2024

mroy-seedbox commented May 3, 2024

jtcohen6 commented May 10, 2021 •

edited

Loading

alecrubin commented Jun 11, 2021 •

edited

Loading

tomasfarias commented May 10, 2022 •

edited

Loading