[Bug] Exit codes do not match documentation #4479

moltar · 2021-12-14T04:06:50Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

SSL connection has been closed unexpectedly

exit status 1

Expected Behavior

As documented: https://docs.getdbt.com/reference/exit-codes

2 The dbt invocation completed with an unhandled error (eg. ctrl-c, network interruption, etc).

Steps To Reproduce

No response

Relevant log output

Environment

- OS: public.ecr.aws/bitnami/python:3.8-prod (Docker image)
- Python: 3.8
- dbt: 0.21.0

What database are you using dbt with?

postgres

Additional Context

Running inside AWS CodeBuild
Database is RDS Aurora Serverless
Database closes the connection due to auto-scaling

iknox-fa · 2022-01-03T16:59:40Z

Hi @moltar thanks for reaching out with your question. I believe that this sort of error is considered "handled" in that
the entire run command completed even though a portion (in this case a single node) did have a network error.

We can certainly make the documentation more clear-- to better understand how our users utilize exit codes, can you explain how the exit codes effect your use case as it seems that all potential outcomes are programmatically available?

moltar · 2022-01-04T09:44:26Z

We can certainly make the documentation more clear-- to better understand how our users utilize exit codes, can you explain how the exit codes effect your use case as it seems that all potential outcomes are programmatically available?

We are orchestrating dbt via step functions, and I wanted to instruct a step function to retry the operation if an exit code matched a pattern.

jtcohen6 · 2022-01-04T14:52:26Z

@moltar That's useful context! A few quick thoughts from me:

If dbt encounters a handled error (exit code 1) affecting one or more nodes, in which the overall invocation still completes, dbt will write a results artifact (run_results.json, docs) that includes much more detailed information about every node that ran, whether it succeeded, and its specific error message. You could parse that artifact to determine whether the error message warrants a retry—in fact, dbt can do it for you, as of v1, using the stateful result: node selector (docs). Fun fact: If you use --fail-fast, this will "interrupt" the invocation as soon as a node fails, so dbt won't write run_results.json and will return exit code 2.

That's all at the level of the invocation. We've also been discussing (#3303) better handling at the node/query level for transient/intermittent errors, such as SSL connection has been closed unexpectedly, that may succeed if retried. In this case, dbt would catch that error from the database cursor, identify it as retryable, and run the same query again. Only if it failed on each of X retries would dbt return the handled error and exit code 1.

moltar · 2022-01-06T04:14:26Z

@jtcohen6 thank you for providing this excellent summary!

dbt will write a results artifact (run_results.json, docs)

Our problem is that we are triggering dbt job via Step Function and monitoring SFN, and retrying inside SFN, which does not have access to the result file.

There are workarounds we can do, ofc, since we are storing artifacts, so we can just read it in another step and try to figure out what caused the error.

But I thought going by the exit code would be the easiest as this info is already exposed to the SFN execution context and can be used in the step definitions.

in fact, dbt can do it for you, as of v1, using the stateful result: node selector (docs). Fun fact: If you use --fail-fast, this will "interrupt" the invocation as soon as a node fails, so dbt won't write run_results.json and will return exit code 2.

THis might actually be what we need!!

We can look for 2 and then retry more, if this is the case, or fail if it's something else.

github-actions · 2022-10-09T02:14:49Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

github-actions · 2022-10-16T02:15:13Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

moltar added bug Something isn't working triage labels Dec 14, 2021

jtcohen6 added the Team: Execution label Dec 14, 2021

iknox-fa removed the triage label Jan 3, 2022

iknox-fa self-assigned this Jan 12, 2022

iknox-fa added the awaiting_response label Apr 11, 2022

iknox-fa removed their assignment Apr 11, 2022

github-actions bot added the stale Issues that have gone stale label Oct 9, 2022

github-actions bot closed this as completed Oct 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Exit codes do not match documentation #4479

[Bug] Exit codes do not match documentation #4479

moltar commented Dec 14, 2021 •

edited

Loading

iknox-fa commented Jan 3, 2022 •

edited

Loading

moltar commented Jan 4, 2022

jtcohen6 commented Jan 4, 2022

moltar commented Jan 6, 2022

github-actions bot commented Oct 9, 2022

github-actions bot commented Oct 16, 2022

[Bug] Exit codes do not match documentation #4479

[Bug] Exit codes do not match documentation #4479

Comments

moltar commented Dec 14, 2021 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

What database are you using dbt with?

Additional Context

iknox-fa commented Jan 3, 2022 • edited Loading

moltar commented Jan 4, 2022

jtcohen6 commented Jan 4, 2022

moltar commented Jan 6, 2022

github-actions bot commented Oct 9, 2022

github-actions bot commented Oct 16, 2022

moltar commented Dec 14, 2021 •

edited

Loading

iknox-fa commented Jan 3, 2022 •

edited

Loading