Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Switch to running deadline_vfs as os_user #223

Merged
merged 12 commits into from
Mar 22, 2024
Merged

feat: Switch to running deadline_vfs as os_user #223

merged 12 commits into from
Mar 22, 2024

Conversation

baxeaz
Copy link
Contributor

@baxeaz baxeaz commented Mar 21, 2024

What was the problem/requirement? (What/Why)

We're currently running deadline_vfs as the privileged "deadline-worker" user rather than "job-user".

This change switches to launching as os_user (corresponding to job-user in our SMFs)

What was the solution? (How)

Switch to launching as os_user (job-user)

Add os_group permissions to manifest so the manifest is readable by the VFS which is now run as job-user.

Shut down vfs as job-user. Also simplifying some of our shut down code.

What is the impact of this change?

deadline_vfs will now be run as the less privileged "job-user", but functionally VIRTUAL file system jobs should continue to work as before.

How was this change tested?

Existing unit tests updated to match new expected behavior pass. New tests added which pass.

Manual testing on a CMF instance.

Was this change documented?

Internal documentation - should not affect customer workflows

Is this a breaking change?

No

@baxeaz baxeaz requested a review from a team as a code owner March 21, 2024 20:49
@baxeaz baxeaz changed the title Switch to running deadline_vfs as os_user feat: Switch to running deadline_vfs as os_user Mar 21, 2024
baxeaz and others added 9 commits March 21, 2024 21:01
Signed-off-by: Brian Axelson <baxelson@amazon.com>
…eError (#206)

Improve error handling for S3 requests by
- adding "retries" configuration to the S3 client
- adding BotoCoreError handling to cover S3 timeout errors (e.g., ReadTimeoutError, ConnectTimeoutError)

Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>
…utput files to be synced (#211)

Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>
Updates the requirements on [python-semantic-release](https://github.com/python-semantic-release/python-semantic-release) to permit the latest version.
- [Release notes](https://github.com/python-semantic-release/python-semantic-release/releases)
- [Changelog](https://github.com/python-semantic-release/python-semantic-release/blob/master/CHANGELOG.md)
- [Commits](python-semantic-release/python-semantic-release@v8.7.0...v9.2.2)

---
updated-dependencies:
- dependency-name: python-semantic-release
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>
Signed-off-by: client-software-ci <129794699+client-software-ci@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>
)

Updates the requirements on [coverage[toml]](https://github.com/nedbat/coveragepy) to permit the latest version.
- [Release notes](https://github.com/nedbat/coveragepy/releases)
- [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst)
- [Commits](nedbat/coveragepy@7.3.0...7.4.0)

---
updated-dependencies:
- dependency-name: coverage[toml]
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>
Signed-off-by: Caden Marofke <marofke@amazon.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>
Signed-off-by: Charles Moore <122481442+moorec-aws@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>
** BREAKING CHANGE **
* The PathFormat enum's values went from all lowercase to all uppercase
* The source_path_root in the path mapping rules return value from sync_inputs went from all lowercase to all uppercase

Signed-off-by: Morgan Epp <60796713+epmog@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>
src/deadline/job_attachments/vfs.py Show resolved Hide resolved
@@ -892,7 +892,7 @@ def handle_existing_vfs(
Args:
manifests (BaseAssetManifest): The manifest for the new inputs to be mounted
mount_point (str): The local directory where the manifest is to be mounted

os_user: the user executing the job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've preferred the phrase "running the job", can you switch the few places this occurs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Signed-off-by: Brian Axelson <baxelson@amazon.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>
@baxeaz baxeaz requested a review from mwiebe March 22, 2024 00:03
as the provided user
:param mount_path: path to mounted folder
"""
log.info(f"Attempting to shut down {mount_path} as {os_user}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are these logs ending up? The Worker Agent intentionally isn't sending the job_attachments logger to its own logger, and this isn't using the LoggerAdaptor for the Session logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These end up in the customer's cloudwatch logs currently - should they have all been converted to some other handler/wrapper at some point? Looking around I do see that vfs.py seems to be doing logging "differently" than everything else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The customer's CloudWatch Logs? Are you certain? Have you seen these log statements in customer logs? I don't think that they are.

The Worker Agent forwards two different log streams to the customer's account.

  1. The Agent logs. Those are from the root logger in the Agent process. The agent intentionally configures the job attachment logger to not propagate, so the logs generated to a deadline.job_attachments.* logger won't end up in those logs.
  2. The Session logs. Those require a LoggerAdaptor that includes the session_id in the log records. e.g. https://github.com/casillas2/deadline-cloud/blob/f76bb981e6555c5d2f5eb54a42c1bd4e73d0bf6f/src/deadline/job_attachments/asset_sync.py#L80-L81

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be misunderstanding something about when which logs go where - In my own testing from yesterday in my account in /aws/deadline/farm-cbe0061ca28c4715a1fec99801e29d83/queue-aaaab3d24be0478ba39a2f7605f747e9/session-c518393409904689a5267416f081f0d7 I get:

  | 2024-03-21T23:55:55.636Z | Launching as user job-user
  | 2024-03-21T23:55:55.641Z | Launched VFS as pid 25598
  | 2024-03-21T23:55:55.641Z | Waiting for is_mount at /var/tmp/openjd/session-c518393409904689a5267416f081f0d7x8ku96ry/assetroot-20264314526cbfe11314 to return True..
...
2024-03-21T23:55:56.924Z | Terminating deadline_vfs processes at /var/tmp/openjd/session-c518393409904689a5267416f081f0d7x8ku96ry/assetroot-20264314526cbfe11314.
  | 2024-03-21T23:55:56.925Z | Attempting to shut down /var/tmp/openjd/session-c518393409904689a5267416f081f0d7x8ku96ry/assetroot-20264314526cbfe11314 as job-user
  | 2024-03-21T23:55:56.925Z | Using saved path /opt/deadline_vfs/bin/deadline_vfs

For validating the VFS we use CMFs though - does that change things?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, if you're seeing it in your logs then I'm missing something. There may be a bug in the worker agent; log records without a session_id entry shouldn't be going to the Session log.

So, fine for now, but we'll likely have to fix that bug in the agent and update the VFS logging at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants