Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log WorkflowID, RunID, domainName when a workflow times out or gets terminated #4548

Merged
merged 2 commits into from
Oct 6, 2021

Conversation

WToma
Copy link
Contributor

@WToma WToma commented Oct 6, 2021

What changed?
A logging call was added that's going to log a message whenever a workflow is timed out or gets forcefully terminated (but not when it's cancelled or if it finishes in any other way, including normal completion or failure).

Why?
Currently only a metric is logged. Metrics can be used to alert on this condition, but it doesn't make it apparent which workflows are affected. If manual cleanup activities need to be performed (e.g. investigating why the workflow timed out) the operator needs the WorkflowID and/or the RunID. Similar considerations apply for terminations. Although terminations are usually operator-initiated, so it could be expected that the operators already know which workflows are being terminated, I think it's useful to have these in the logs.

How did you test it?
I tested the change by building and running the Cadence server locally. The started a workflow with a 1 second timeout with no workers registered:

./cadence --domain samples-domain workflow start --workflow_id sample-workflow-id-1234 --workflow_type sample-workflow-type --tasklist samples-tasklist --execution_timeout 1

This generated the below message:

{"level":"info","ts":"2021-10-06T13:48:59.793-0700","msg":"workflow execution timed out","service":"cadence-history","shard-id":2,"address":"127.0.0.1:7934","component":"history-cache","wf-id":"sample-workflow-id-1234","wf-run-id":"d536c198-23ea-4ade-aadb-446180746f0c","wf-domain-name":"samples-domain","logging-call-at":"context_util.go:147"}

Potential risks
If, for some reason, workflows time out or get terminated a lot, the high log volume could have adverse effects.

Release notes
N/A

Documentation Changes
N/A

@CLAassistant
Copy link

CLAassistant commented Oct 6, 2021

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@longquanzheng longquanzheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for the contribution!

@longquanzheng longquanzheng merged commit 3cd5166 into uber:master Oct 6, 2021
@coveralls
Copy link

Pull Request Test Coverage Report for Build f3bbe37e-801c-4d65-827a-a2459d8f8430

  • 16 of 16 (100.0%) changed or added relevant lines in 2 files are covered.
  • 8 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.06%) to 56.443%

Files with Coverage Reduction New Missed Lines %
common/task/weightedRoundRobinTaskScheduler.go 2 89.64%
common/types/mapper/thrift/shared.go 2 64.52%
service/history/queue/timer_queue_processor.go 2 59.33%
service/matching/taskListManager.go 2 74.09%
Totals Coverage Status
Change from base Build f530e2da-ff9c-40ec-b55e-5ed7a97a72b1: 0.06%
Covered Lines: 80640
Relevant Lines: 142869

💛 - Coveralls

@WToma WToma deleted the log-timed-out-workflows branch October 6, 2021 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants