-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Switch to running deadline_vfs as os_user #223
Conversation
Signed-off-by: Brian Axelson <baxelson@amazon.com>
…eError (#206) Improve error handling for S3 requests by - adding "retries" configuration to the S3 client - adding BotoCoreError handling to cover S3 timeout errors (e.g., ReadTimeoutError, ConnectTimeoutError) Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com> Signed-off-by: Brian Axelson <baxelson@amazon.com>
…utput files to be synced (#211) Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com> Signed-off-by: Brian Axelson <baxelson@amazon.com>
Updates the requirements on [python-semantic-release](https://github.com/python-semantic-release/python-semantic-release) to permit the latest version. - [Release notes](https://github.com/python-semantic-release/python-semantic-release/releases) - [Changelog](https://github.com/python-semantic-release/python-semantic-release/blob/master/CHANGELOG.md) - [Commits](python-semantic-release/python-semantic-release@v8.7.0...v9.2.2) --- updated-dependencies: - dependency-name: python-semantic-release dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Brian Axelson <baxelson@amazon.com>
Signed-off-by: client-software-ci <129794699+client-software-ci@users.noreply.github.com> Signed-off-by: Brian Axelson <baxelson@amazon.com>
) Updates the requirements on [coverage[toml]](https://github.com/nedbat/coveragepy) to permit the latest version. - [Release notes](https://github.com/nedbat/coveragepy/releases) - [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst) - [Commits](nedbat/coveragepy@7.3.0...7.4.0) --- updated-dependencies: - dependency-name: coverage[toml] dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Brian Axelson <baxelson@amazon.com>
Signed-off-by: Caden Marofke <marofke@amazon.com> Signed-off-by: Brian Axelson <baxelson@amazon.com>
Signed-off-by: Charles Moore <122481442+moorec-aws@users.noreply.github.com> Signed-off-by: Brian Axelson <baxelson@amazon.com>
** BREAKING CHANGE ** * The PathFormat enum's values went from all lowercase to all uppercase * The source_path_root in the path mapping rules return value from sync_inputs went from all lowercase to all uppercase Signed-off-by: Morgan Epp <60796713+epmog@users.noreply.github.com> Signed-off-by: Brian Axelson <baxelson@amazon.com>
@@ -892,7 +892,7 @@ def handle_existing_vfs( | |||
Args: | |||
manifests (BaseAssetManifest): The manifest for the new inputs to be mounted | |||
mount_point (str): The local directory where the manifest is to be mounted | |||
|
|||
os_user: the user executing the job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've preferred the phrase "running the job", can you switch the few places this occurs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
Signed-off-by: Brian Axelson <baxelson@amazon.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>
as the provided user | ||
:param mount_path: path to mounted folder | ||
""" | ||
log.info(f"Attempting to shut down {mount_path} as {os_user}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are these logs ending up? The Worker Agent intentionally isn't sending the job_attachments logger to its own logger, and this isn't using the LoggerAdaptor for the Session logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These end up in the customer's cloudwatch logs currently - should they have all been converted to some other handler/wrapper at some point? Looking around I do see that vfs.py seems to be doing logging "differently" than everything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The customer's CloudWatch Logs? Are you certain? Have you seen these log statements in customer logs? I don't think that they are.
The Worker Agent forwards two different log streams to the customer's account.
- The Agent logs. Those are from the root logger in the Agent process. The agent intentionally configures the job attachment logger to not propagate, so the logs generated to a
deadline.job_attachments.*
logger won't end up in those logs. - The Session logs. Those require a LoggerAdaptor that includes the session_id in the log records. e.g. https://github.com/casillas2/deadline-cloud/blob/f76bb981e6555c5d2f5eb54a42c1bd4e73d0bf6f/src/deadline/job_attachments/asset_sync.py#L80-L81
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be misunderstanding something about when which logs go where - In my own testing from yesterday in my account in /aws/deadline/farm-cbe0061ca28c4715a1fec99801e29d83/queue-aaaab3d24be0478ba39a2f7605f747e9/session-c518393409904689a5267416f081f0d7 I get:
| 2024-03-21T23:55:55.636Z | Launching as user job-user
| 2024-03-21T23:55:55.641Z | Launched VFS as pid 25598
| 2024-03-21T23:55:55.641Z | Waiting for is_mount at /var/tmp/openjd/session-c518393409904689a5267416f081f0d7x8ku96ry/assetroot-20264314526cbfe11314 to return True..
...
2024-03-21T23:55:56.924Z | Terminating deadline_vfs processes at /var/tmp/openjd/session-c518393409904689a5267416f081f0d7x8ku96ry/assetroot-20264314526cbfe11314.
| 2024-03-21T23:55:56.925Z | Attempting to shut down /var/tmp/openjd/session-c518393409904689a5267416f081f0d7x8ku96ry/assetroot-20264314526cbfe11314 as job-user
| 2024-03-21T23:55:56.925Z | Using saved path /opt/deadline_vfs/bin/deadline_vfs
For validating the VFS we use CMFs though - does that change things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, if you're seeing it in your logs then I'm missing something. There may be a bug in the worker agent; log records without a session_id entry shouldn't be going to the Session log.
So, fine for now, but we'll likely have to fix that bug in the agent and update the VFS logging at the same time.
What was the problem/requirement? (What/Why)
We're currently running deadline_vfs as the privileged "deadline-worker" user rather than "job-user".
This change switches to launching as os_user (corresponding to job-user in our SMFs)
What was the solution? (How)
Switch to launching as os_user (job-user)
Add os_group permissions to manifest so the manifest is readable by the VFS which is now run as job-user.
Shut down vfs as job-user. Also simplifying some of our shut down code.
What is the impact of this change?
deadline_vfs will now be run as the less privileged "job-user", but functionally VIRTUAL file system jobs should continue to work as before.
How was this change tested?
Existing unit tests updated to match new expected behavior pass. New tests added which pass.
Manual testing on a CMF instance.
Was this change documented?
Internal documentation - should not affect customer workflows
Is this a breaking change?
No