-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor!: improve support for ec2 instance workers #143
refactor!: improve support for ec2 instance workers #143
Conversation
@@ -434,7 +436,7 @@ def _start_worker_agent(self) -> None: | |||
[ | |||
"echo 'Running Get-Process to check if the agent is running'", | |||
'for($i=1; $i -le 30 -and "" -ne $err ; $i++){sleep $i; Get-Process pythonservice -ErrorVariable err}', | |||
"IF(Get-Process pythonservice){echo 'service is running'}ELSE{exit 1}", | |||
"IF(Get-Process pythonservice){echo '+++SERVICE IS RUNNING+++'}ELSE{echo '+++SERVICE NOT RUNNING+++'; Get-Content -Encoding utf8 C:\ProgramData\Amazon\Deadline\Logs\worker-agent-bootstrap.log,C:\ProgramData\Amazon\Deadline\Logs\worker-agent.log; exit 1}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found failures in this script to be challenging to debug if the agent doesn't start. The reason was typically something discoverable in the agent logs, so I've added outputting the logs on error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 👏 👏 This is so handy
|
||
cmds = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All "install agent & setup agent user" stuff has been moved out of the base class and into the derived class.
@@ -492,44 +479,6 @@ def configure_worker_command(self, *, config: DeadlineWorkerConfiguration) -> st | |||
|
|||
return "; ".join(cmds) | |||
|
|||
def userdata(self, s3_files) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
userdata is exclusively in the derived class
@@ -543,14 +492,6 @@ def stop_worker_service(self): | |||
|
|||
assert cmd_result.exit_code == 0, f"Failed to stop Worker Agent service: : {cmd_result}" | |||
|
|||
def ami_ssm_param_name(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exclusively in the derived class now.
def configure_worker_command(self, *, config: DeadlineWorkerConfiguration) -> str: | ||
"""Get the command to configure the Worker. This must be run as Administrator.""" | ||
cmds = [ | ||
"Set-PSDebug -trace 1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The debug tracing is an intentional addition. If this ssm command fails, then it's challenging to understand why without it.
|
||
return "; ".join(cmds) | ||
|
||
def userdata(self, s3_files) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No changes here. Just a copy-over.
], | ||
), | ||
# If the worker didn't start, then print out the agent logs that exist to aid in debugging. | ||
"if test $? -ne 0; then echo '+++AGENT NOT RUNNING+++'; cat /var/log/amazon/deadline/worker-agent-bootstrap.log /var/log/amazon/deadline/worker-agent.log; exit 1; fi", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to make it much easier to diagnose issues from the logs when the agent fails to start.
Problem: The PosixInstanceWorker and WindowsInstanceWorker classes implement a CMF Worker based on an ec2 host. In both cases they start with an empty AMI image (AL2023 or Win Server 2022) then set up users, install the agent, etc. Tests that need to use an AMI that already has users and the agent installed end up having to hack around a lot of assumptions in these classes, or reimplement everything for an ec2-based worker from scratch. Solution: Refactor the PosixInstanceWorker and WindowsInstanceWorker to split them into os-specific base classes that contain all of the generic worker implementation stuff (starting the worker, getting its id, etc), and a derived class that implements the user-setup, agent install, and the like. This makes space for tests that use an AMI with the users & agent pre-setup to derive from the os-specific base classes to get the generic ec2-worker test functionality. BREAKING CHANGE: PosixInstanceWorker has been renamed to PosixInstanceBuildWorker, and WindowsInstanceWorker has been renamed to WindowsInstanceBuildWorker. Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
3e26834
to
e20a836
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit
Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com> Co-authored-by: Samuel Anderson <119458760+AWS-Samuel@users.noreply.github.com>
Quality Gate passedIssues Measures |
What was the problem/requirement? (What/Why)
The PosixInstanceWorker and WindowsInstanceWorker classes implement a CMF Worker based on an ec2 host. In both cases they start with an empty AMI image (AL2023 or Win Server 2022) then set up users, install the agent, etc. Tests that need to use an AMI that already has users and the agent installed end up having to hack around a lot of assumptions in these classes, or reimplement everything for an ec2-based worker from scratch.
What was the solution? (How)
Refactor the PosixInstanceWorker and WindowsInstanceWorker to split them into os-specific base classes that contain all of the generic worker implementation stuff (starting the worker, getting its id, etc), and a derived class that implements the user-setup, agent install, and the like. This makes space for tests that use an AMI with the users & agent pre-setup to derive from the os-specific base classes to get the generic ec2-worker test functionality.
What is the impact of this change?
Downstream consumers no longer need to resort to hacky workarounds to test worker configurations that have the agent & host setup baked into an AMI
How was this change tested?
I've run this with the
deadline-cloud-worker-agent
e2e tests; the windows & linux e2e tests, and the cross-os tests under each.I've also run this against internal testing infrastructure that is providing a pre-built AL2023 and WindowsServer2022 AMI to the testing infrastructure.
Everything passes everywhere. The
deadline-cloud-worker-agent
e2e tests need no modifications to take on these changes.Was this change documented?
No
Is this a breaking change?
BREAKING CHANGE: PosixInstanceWorker has been renamed to PosixInstanceBuildWorker, and WindowsInstanceWorker has been renamed to WindowsInstanceBuildWorker.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.