Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: report action timeout as failed with timeout message #165

Merged
merged 1 commit into from
Feb 21, 2024

Conversation

ddneilson
Copy link
Contributor

@ddneilson ddneilson commented Feb 21, 2024

What was the problem/requirement? (What/Why)

Previously, if an OpenJD Action was canceled due to a timeout being reached then we'd report the action as just canceled. The change to update to openjd-sessions 0.5.0 ( #160 ) made it so that timeout actions would report as FAILED, but didn't change the failure message to make it clear that the reason for the failure was a timeout.

What was the solution? (How)

We mutate the action status when we recieve it to override the failure message with one that indicates that the action has reached its runtime limit.

What is the impact of this change?

The customer should have an easier time of identifying which tasks/actions failed due to a timeout rather than having just been canceled for some other reason.

How was this change tested?

I updated a unit test, and also ran the agent against the service with a sleep job that has a runtime limit. Here's a snapshot of one of the session actions that results:

% aws deadline get-session-action --farm-id $FARM_ID --queue-id $QUEUE_ID --job-id $JOB_ID --session-action-id $SA_ID
{
    "sessionActionId": "sessionaction-f7efc3852c574cc0b6fb92a33ff140d6-2",
    "status": "FAILED",
    "startedAt": "2024-02-21T20:11:15.816000+00:00",
    "endedAt": "2024-02-21T20:12:00.834000+00:00",
    "progressPercent": 73.3333,
    "sessionId": "session-f7efc3852c574cc0b6fb92a33ff140d6",
    "processExitCode": -9,
    "progressMessage": "TIMEOUT - Exceeded the allotted runtime limit.",
    "definition": {
        "taskRun": {
            "taskId": "task-0cee490179da401aa4c5f0327906eb6c-0",
            "stepId": "step-0cee490179da401aa4c5f0327906eb6c",
            "parameters": {}
        }
    }
}

Was this change documented?

N/A

Is this a breaking change?

No

gmchale79
gmchale79 previously approved these changes Feb 21, 2024
jericht
jericht previously approved these changes Feb 21, 2024
test/unit/sessions/test_session.py Show resolved Hide resolved
src/deadline_worker_agent/sessions/session.py Outdated Show resolved Hide resolved
mwiebe
mwiebe previously approved these changes Feb 21, 2024
jericht
jericht previously approved these changes Feb 21, 2024
gmchale79
gmchale79 previously approved these changes Feb 21, 2024
Problem:

Previously, if an OpenJD Action was canceled due to a timeout being
reached then we'd report the action as just canceled. The change to
update to openjd-sessions 0.5.0
( #160 )
made it so that timeout actions would report as FAILED, but didn't
change the failure message to make it clear that the reason for the
failure was a timeout.

Solution:

We mutate the action status when we recieve it to override the failure
message with one that indicates that the action has reached its runtime
limit.

Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
@ddneilson ddneilson merged commit ff36123 into mainline Feb 21, 2024
9 checks passed
@ddneilson ddneilson deleted the ddneilson/19468 branch February 21, 2024 23:25
gmchale79 pushed a commit that referenced this pull request Mar 11, 2024
Problem:

Previously, if an OpenJD Action was canceled due to a timeout being
reached then we'd report the action as just canceled. The change to
update to openjd-sessions 0.5.0
( #160 )
made it so that timeout actions would report as FAILED, but didn't
change the failure message to make it clear that the reason for the
failure was a timeout.

Solution:

We mutate the action status when we recieve it to override the failure
message with one that indicates that the action has reached its runtime
limit.

Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
Signed-off-by: Graeme McHale <gmchale@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants