Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Exit codes do not match documentation #4479

Closed
1 task done
moltar opened this issue Dec 14, 2021 · 6 comments
Closed
1 task done

[Bug] Exit codes do not match documentation #4479

moltar opened this issue Dec 14, 2021 · 6 comments
Labels
awaiting_response bug Something isn't working stale Issues that have gone stale

Comments

@moltar
Copy link

moltar commented Dec 14, 2021

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

SSL connection has been closed unexpectedly

exit status 1

Expected Behavior

As documented: https://docs.getdbt.com/reference/exit-codes

2 The dbt invocation completed with an unhandled error (eg. ctrl-c, network interruption, etc).

Steps To Reproduce

No response

Relevant log output

screenshot-20211214T110247-ef5zmmMi

Environment

- OS: public.ecr.aws/bitnami/python:3.8-prod (Docker image)
- Python: 3.8
- dbt: 0.21.0

What database are you using dbt with?

postgres

Additional Context

  • Running inside AWS CodeBuild
  • Database is RDS Aurora Serverless
  • Database closes the connection due to auto-scaling
@moltar moltar added bug Something isn't working triage labels Dec 14, 2021
@iknox-fa
Copy link
Contributor

iknox-fa commented Jan 3, 2022

Hi @moltar thanks for reaching out with your question. I believe that this sort of error is considered "handled" in that
the entire run command completed even though a portion (in this case a single node) did have a network error.

We can certainly make the documentation more clear-- to better understand how our users utilize exit codes, can you explain how the exit codes effect your use case as it seems that all potential outcomes are programmatically available?

@iknox-fa iknox-fa removed the triage label Jan 3, 2022
@moltar
Copy link
Author

moltar commented Jan 4, 2022

We can certainly make the documentation more clear-- to better understand how our users utilize exit codes, can you explain how the exit codes effect your use case as it seems that all potential outcomes are programmatically available?

We are orchestrating dbt via step functions, and I wanted to instruct a step function to retry the operation if an exit code matched a pattern.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 4, 2022

@moltar That's useful context! A few quick thoughts from me:

If dbt encounters a handled error (exit code 1) affecting one or more nodes, in which the overall invocation still completes, dbt will write a results artifact (run_results.json, docs) that includes much more detailed information about every node that ran, whether it succeeded, and its specific error message. You could parse that artifact to determine whether the error message warrants a retry—in fact, dbt can do it for you, as of v1, using the stateful result: node selector (docs). Fun fact: If you use --fail-fast, this will "interrupt" the invocation as soon as a node fails, so dbt won't write run_results.json and will return exit code 2.

That's all at the level of the invocation. We've also been discussing (#3303) better handling at the node/query level for transient/intermittent errors, such as SSL connection has been closed unexpectedly, that may succeed if retried. In this case, dbt would catch that error from the database cursor, identify it as retryable, and run the same query again. Only if it failed on each of X retries would dbt return the handled error and exit code 1.

@moltar
Copy link
Author

moltar commented Jan 6, 2022

@jtcohen6 thank you for providing this excellent summary!

dbt will write a results artifact (run_results.json, docs)

Our problem is that we are triggering dbt job via Step Function and monitoring SFN, and retrying inside SFN, which does not have access to the result file.

There are workarounds we can do, ofc, since we are storing artifacts, so we can just read it in another step and try to figure out what caused the error.

But I thought going by the exit code would be the easiest as this info is already exposed to the SFN execution context and can be used in the step definitions.

in fact, dbt can do it for you, as of v1, using the stateful result: node selector (docs). Fun fact: If you use --fail-fast, this will "interrupt" the invocation as soon as a node fails, so dbt won't write run_results.json and will return exit code 2.

THis might actually be what we need!!

We can look for 2 and then retry more, if this is the case, or fail if it's something else.

@iknox-fa iknox-fa self-assigned this Jan 12, 2022
@iknox-fa iknox-fa removed their assignment Apr 11, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2022

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Oct 9, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting_response bug Something isn't working stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

3 participants