Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(deadline): add option to the RenderQueue to use cachefilesd #367

Merged
merged 3 commits into from
Mar 31, 2021

Conversation

ddneilson
Copy link
Contributor

@ddneilson ddneilson commented Mar 29, 2021

Implements: #366

Testing

Started up a basic render farm using the RFDK examples, but modified the example to set the fsc mount option on the Amazon EFS and enable cachefilesd on the RenderQueue.

Used systems manager connection to remote in to the RenderQueue's ECS container host, and then:

  1. Verified that cachefilesd is installed and running.
  2. Verified that the /mnt/repo filesystem was mounted using the EFS helper which in-turn uses the NFSv4 mount driver, and that the fsc option was properly set on the NFSv4 mount point.
[root@ip-10-0-126-223 ~]# mount | grep nfs
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
127.0.0.1:/ on /mnt/repo type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,port=20312,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,fsc,local_lock=none,addr=127.0.0.1,_netdev)
  1. Ran a series of full copies of the /mnt/repo filesystem (approx 1.3 GB of data) to the local device both with and without cachefilesd running. The image below shows the throughput metrics on the Amazon EFS filesystem through these tests; as expected, the read throughput with cachefilesd is much lower than a full copy once the local cache has been populated.

RepoCopy

Performance Testing

Test setup:

  1. RFDK Basic all-in application with an added bastion host set up as a Workstation (installed Deadline & DCV). So... single RCS (c5.large), EFS burst-mode filesystem with ~1GB of data (encrypted), DocDB (db.r5.large), Linux-based Worker ASG (t3.medium), and TLS enabled throughout

summary

So, cachefilesd shaves about 30% of metered throughput when starting 200 workers (9.73MB/s -> 6.9MB/s) and 26% of metered throughput when rendering 1k 'sleep 10' jobs (1.9MB/s -> 1.6MB/s). With the standard caveats, of course, about this being a single data point so there will be sizable error bars on those.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@ddneilson ddneilson linked an issue Mar 29, 2021 that may be closed by this pull request
2 tasks
@ddneilson ddneilson requested a review from jusiskin March 29, 2021 16:55
@RandomInsano
Copy link

RandomInsano commented Mar 29, 2021

Looks great! My only concern is with the namespace for this option and underlying function names being generic as it has the following two effects:

  1. Those aware of cachefilesd have to check the docs to know what the underlying tech is
  2. If we find an even better caching option or choose to change the Repo's backing protocol (say to SMB) we'll have problems.

Maybe we could swap "enable_local_file_caching" for "enable_fsc_caching", etc?

@jusiskin jusiskin added the contribution/core This is a PR that came from AWS. label Mar 30, 2021
jericht
jericht previously approved these changes Mar 31, 2021
Copy link
Contributor

@jericht jericht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one really minor comment, otherwise this LGTM

@ddneilson
Copy link
Contributor Author

ddneilson commented Mar 31, 2021

Looks great! My only concern is with the namespace for this option and underlying function names being generic as it has the following two effects:

1. Those aware of cachefilesd have to check the docs to know what the underlying tech is

2. If we find an _even better_ caching option or choose to change the Repo's backing protocol (say to SMB) we'll have problems.

Maybe we could swap "enable_local_file_caching" for "enable_fsc_caching", etc?

@RandomInsano Thanks for the feedback, Edwin. Variable/property naming is hard. ;-) On one hand... generic for those that don't know the details, or don't want to know/care... On the other hand... it is nice to know what is going on under the hood. I'm trying to strike that balance by going generic with the property name and providing details in the docstring (which shows up under quick-info-type IDE functionality). My thinking is also that if there's a different caching tech for a different filesystem, then we'd want to support that under the same option but detect the filesystem type if we can; it's all the same high-level functionality, regardless of how it's accomplished.

You're thinking that I'm missing the mark with it?

Copy link
Contributor

@jusiskin jusiskin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor question, but otherwise looks great.

@jusiskin jusiskin changed the title feat(deadline): Adds option to the RenderQueue to use cachefilesd feat(deadline): adds option to the RenderQueue to use cachefilesd Mar 31, 2021
@jusiskin jusiskin changed the title feat(deadline): adds option to the RenderQueue to use cachefilesd feat(deadline): add option to the RenderQueue to use cachefilesd Mar 31, 2021
@jusiskin jusiskin merged commit 901b749 into aws:mainline Mar 31, 2021
@ddneilson ddneilson deleted the file-caching-rq branch April 1, 2021 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution/core This is a PR that came from AWS.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable cachefilesd on RenderQueue
4 participants