Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor!: improve support for ec2 instance workers #143

Merged
merged 2 commits into from
Aug 15, 2024

Conversation

ddneilson
Copy link
Contributor

What was the problem/requirement? (What/Why)

The PosixInstanceWorker and WindowsInstanceWorker classes implement a CMF Worker based on an ec2 host. In both cases they start with an empty AMI image (AL2023 or Win Server 2022) then set up users, install the agent, etc. Tests that need to use an AMI that already has users and the agent installed end up having to hack around a lot of assumptions in these classes, or reimplement everything for an ec2-based worker from scratch.

What was the solution? (How)

Refactor the PosixInstanceWorker and WindowsInstanceWorker to split them into os-specific base classes that contain all of the generic worker implementation stuff (starting the worker, getting its id, etc), and a derived class that implements the user-setup, agent install, and the like. This makes space for tests that use an AMI with the users & agent pre-setup to derive from the os-specific base classes to get the generic ec2-worker test functionality.

What is the impact of this change?

Downstream consumers no longer need to resort to hacky workarounds to test worker configurations that have the agent & host setup baked into an AMI

How was this change tested?

I've run this with the deadline-cloud-worker-agent e2e tests; the windows & linux e2e tests, and the cross-os tests under each.

I've also run this against internal testing infrastructure that is providing a pre-built AL2023 and WindowsServer2022 AMI to the testing infrastructure.

Everything passes everywhere. The deadline-cloud-worker-agent e2e tests need no modifications to take on these changes.

Was this change documented?

No

Is this a breaking change?

BREAKING CHANGE: PosixInstanceWorker has been renamed to PosixInstanceBuildWorker, and WindowsInstanceWorker has been renamed to WindowsInstanceBuildWorker.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@@ -434,7 +436,7 @@ def _start_worker_agent(self) -> None:
[
"echo 'Running Get-Process to check if the agent is running'",
'for($i=1; $i -le 30 -and "" -ne $err ; $i++){sleep $i; Get-Process pythonservice -ErrorVariable err}',
"IF(Get-Process pythonservice){echo 'service is running'}ELSE{exit 1}",
"IF(Get-Process pythonservice){echo '+++SERVICE IS RUNNING+++'}ELSE{echo '+++SERVICE NOT RUNNING+++'; Get-Content -Encoding utf8 C:\ProgramData\Amazon\Deadline\Logs\worker-agent-bootstrap.log,C:\ProgramData\Amazon\Deadline\Logs\worker-agent.log; exit 1}",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found failures in this script to be challenging to debug if the agent doesn't start. The reason was typically something discoverable in the agent logs, so I've added outputting the logs on error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 👏 👏 This is so handy


cmds = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All "install agent & setup agent user" stuff has been moved out of the base class and into the derived class.

@@ -492,44 +479,6 @@ def configure_worker_command(self, *, config: DeadlineWorkerConfiguration) -> st

return "; ".join(cmds)

def userdata(self, s3_files) -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

userdata is exclusively in the derived class

@@ -543,14 +492,6 @@ def stop_worker_service(self):

assert cmd_result.exit_code == 0, f"Failed to stop Worker Agent service: : {cmd_result}"

def ami_ssm_param_name(self) -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exclusively in the derived class now.

def configure_worker_command(self, *, config: DeadlineWorkerConfiguration) -> str:
"""Get the command to configure the Worker. This must be run as Administrator."""
cmds = [
"Set-PSDebug -trace 1",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debug tracing is an intentional addition. If this ssm command fails, then it's challenging to understand why without it.


return "; ".join(cmds)

def userdata(self, s3_files) -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes here. Just a copy-over.

],
),
# If the worker didn't start, then print out the agent logs that exist to aid in debugging.
"if test $? -ne 0; then echo '+++AGENT NOT RUNNING+++'; cat /var/log/amazon/deadline/worker-agent-bootstrap.log /var/log/amazon/deadline/worker-agent.log; exit 1; fi",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to make it much easier to diagnose issues from the logs when the agent fails to start.

@ddneilson ddneilson marked this pull request as ready for review August 15, 2024 18:47
@ddneilson ddneilson requested a review from a team as a code owner August 15, 2024 18:47
Problem:

The PosixInstanceWorker and WindowsInstanceWorker classes implement a
CMF Worker based on an ec2 host. In both cases they start with an empty
AMI image (AL2023 or Win Server 2022) then set up users, install the
agent, etc. Tests that need to use an AMI that already has users and the
agent installed end up having to hack around a lot of assumptions in
these classes, or reimplement everything for an ec2-based worker from
scratch.

Solution:

Refactor the PosixInstanceWorker and WindowsInstanceWorker to split them
into os-specific base classes that contain all of the generic worker
implementation stuff (starting the worker, getting its id, etc), and a
derived class that implements the user-setup, agent install, and the
like. This makes space for tests that use an AMI with the users & agent
pre-setup to derive from the os-specific base classes to get the generic
ec2-worker test functionality.

BREAKING CHANGE: PosixInstanceWorker has been renamed to
PosixInstanceBuildWorker, and WindowsInstanceWorker has been renamed to
WindowsInstanceBuildWorker.

Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
Copy link
Contributor

@AWS-Samuel AWS-Samuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit

src/deadline_test_fixtures/deadline/worker.py Outdated Show resolved Hide resolved
AWS-Samuel

This comment was marked as spam.

YutongLi291
YutongLi291 previously approved these changes Aug 15, 2024
@ddneilson ddneilson dismissed stale reviews from YutongLi291 and AWS-Samuel via 5bc0bcb August 15, 2024 20:58
Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>

Co-authored-by: Samuel Anderson <119458760+AWS-Samuel@users.noreply.github.com>
Copy link

sonarcloud bot commented Aug 15, 2024

@YutongLi291 YutongLi291 self-requested a review August 15, 2024 21:01
@ddneilson ddneilson merged commit afbc5fb into aws-deadline:mainline Aug 15, 2024
15 checks passed
@ddneilson ddneilson deleted the refactor_for_preinstall branch August 15, 2024 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants