Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate requester compute #513

Merged
merged 3 commits into from
Jul 17, 2023
Merged

Separate requester compute #513

merged 3 commits into from
Jul 17, 2023

Conversation

hevans66
Copy link
Contributor

  • Separates requester and compute nodes to separate EC2 instances. Currently one requester and one compute instance.

  • Factors out IPFS install steps into taks file.

  • This PR creates a new Requester/Compute node, essentially a new Plex instance - running at ec2-18-208-163-46.compute-1.amazonaws.com. I think if were happy with how this is working we can bounce the private ip to this node, then decommission the old compute node in a separate PR.

@vercel
Copy link

vercel bot commented Jul 17, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 17, 2023 7:41pm

@hevans66 hevans66 temporarily deployed to ci July 17, 2023 18:41 — with GitHub Actions Inactive
Comment on lines +96 to +108
- name: Install Nvidia Container Tookit
become: yes
ansible.builtin.apt:
pkg:
- nvidia-docker2
notify:
- Restart docker
when: gpu

- name: Ensure Nvidia persitence daemon is started
ansible.builtin.systemd:
name: nvidia-persistenced
when: gpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocker: the 2 when's can be combined using ansible block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to combine them, but I had them separated to decouple the Restart docker hook as much as possible. My thinking is since most of the jobs compute is running are docker based we want to avoid restarting docker unnecessarily if possible. Probably a minor optimization in this case though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, we can use docker live-restore option. https://docs.docker.com/config/containers/live-restore/

Comment on lines +45 to +79
# Nvidia
- name: Get Nvidia drivers apt key
ansible.builtin.get_url:
url: https://developer.download.nvidia.com/compute/cuda/repos/{{ nvidia_distribution }}/x86_64/cuda-keyring_1.0-1_all.deb
dest: /tmp/cuda-keyring.deb
when: gpu

- name: Add Nvidia Keyring
become: yes
ansible.builtin.apt:
deb: /tmp/cuda-keyring.deb
when: gpu

- name: Get Nvidia Container Tookit GPG key
become: yes
ansible.builtin.shell:
cmd: curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --yes --dearmor -o {{ nvidia_container_toolkit_key_path }}
creates: "{{ nvidia_container_toolkit_key_path }}"
when: gpu

- name: Add Nvidia Container Tookit Repository
become: yes
ansible.builtin.apt_repository:
repo: deb [signed-by={{ nvidia_container_toolkit_key_path }}] https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
state: present
when: gpu

- name: Install required system packages for gpu build
become: yes
ansible.builtin.apt:
pkg:
- cuda-drivers
state: latest
update_cache: true
when: gpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocker: combining block and when.

@alabdao
Copy link
Collaborator

alabdao commented Jul 17, 2023

just a general note, again not a blocker. Just good to delete anything that gets downloaded under /tmp/ to avoid filling up disk space. Especially matters when its a long lived node, small disk, multiple versions get installed, can fill up quick.

@alabdao
Copy link
Collaborator

alabdao commented Jul 17, 2023

since my comments aren't blocking, gonna approve this to move it along.

@hevans66
Copy link
Contributor Author

just a general note, again not a blocker. Just good to delete anything that gets downloaded under /tmp/ to avoid filling up disk space. Especially matters when its a long lived node, small disk, multiple versions get installed, can fill up quick.

👍 💯 agree. I'm gonna merge as is. We can make /tmp/ clean up (and probably disk space monitoring in general) a separate project.

@hevans66 hevans66 merged commit c001733 into main Jul 17, 2023
@hevans66 hevans66 deleted the ops/separate-requester-compute branch July 17, 2023 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants