Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LAB-631] ansible 1.1.0 #706

Merged
merged 17 commits into from
Oct 17, 2023
Merged

[LAB-631] ansible 1.1.0 #706

merged 17 commits into from
Oct 17, 2023

Conversation

alabdao
Copy link
Collaborator

@alabdao alabdao commented Oct 17, 2023

  • supporting latest prod deployment
  • avoid pulling container since they cause slowdown
  • setting receptor url
  • new prod setup
  • removed prod
  • Ability to deploy custom file
  • install config file at the end after bacalhau repo has been intialized
  • add instance id to labels
  • [LAB-595] Dynamically determining if GPUs are available and how many
  • casting to int
  • fixing query
  • fixing bacalhau client version defterminatino command
  • making limit memory determination dynamic
  • gather facts true
  • accept networked jobs
  • removing quotes

@linear
Copy link

linear bot commented Oct 17, 2023

@vercel
Copy link

vercel bot commented Oct 17, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 17, 2023 2:58pm

@alabdao alabdao changed the title ops/LAB 631 ansible 1.1.0 [LAB-631] ansible 1.1.0 Oct 17, 2023
--limit-job-memory 12gb \
{% if gpu %}
--limit-job-gpu 1 \
--limit-job-memory {{ (ansible_memtotal_mb | int * 0.80) | round | int }}Mb \
Copy link
Collaborator Author

@alabdao alabdao Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dynamically determining total memory and setting limit to 80% of that instead of hardcoded 12Gb which is low for some instances.

Comment on lines +22 to +23
{% if num_of_gpus | int > 0 %}
--limit-job-gpu {{ num_of_gpus | int }} \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamically setting number of gpus. This will support non GPU instances as well as instances with GPU > 1 instead of hardcoded to be always 1

@@ -0,0 +1,67 @@
---
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cli flags dont wory anymore, setting configurations in the file.

JobExecutionTimeoutClientIDBypassList: []
JobNegotiationTimeout: 3m0s
MinJobExecutionTimeout: 500ms
MaxJobExecutionTimeout: {{ bacalhau_compute_max_job_execution_timeout | default('24h') }}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allowing override for max execution timeout. set to 24h for now.

JobSelection:
Locality: anywhere
RejectStatelessJobs: false
AcceptNetworkedJobs: true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting it to true here since cli flags dont work atm.

vars:
nvidia_distribution: ubuntu2004
ipfs_version: "0.18.0"
ipfs_path: "/opt/ipfs"
gpu: true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic now.

Comment on lines +30 to +42
# Get GPU info from system
- name: Get lshw display info
become: true
ansible.builtin.command: lshw -c display -json
changed_when: true
register: lshw_output

- name: set number of gpus available
vars:
query: "[?vendor=='NVIDIA Corporation']"
ansible.builtin.set_fact:
num_of_gpus: "{{ lshw_output.stdout | from_json | json_query(query) | length }}"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get number of GPUs for lshw command.

Comment on lines +62 to +63
# - name: Pull common containers
# ansible.builtin.include_tasks: tasks/pull_common_containers.yaml
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pulling contains causes serious load on the system. disabling this for now until figuring out something better here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the compute node will pull the container it needs the first time it runs the Job?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct.
Potentially fix here would be to use packer to create compute image which has Convexity select containers already available.

@@ -1,6 +1,6 @@
# Try running Bacalhau first, to see what version it is.
- name: Check bacalhau version
ansible.builtin.command: /usr/local/bin/bacalhau version
ansible.builtin.command: /usr/local/bin/bacalhau version --client --no-style --hide-header
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only get currently installed bacalhau version without the prettiness.

@@ -9,7 +9,7 @@

- name: Set fact for currently installed version
ansible.builtin.set_fact:
bacalhau_installed_version: "{{ existing_bacalhau_version.stdout.split('Server Version: ')[1] }}"
bacalhau_installed_version: "{{ existing_bacalhau_version.stdout | trim }}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output has now changed from 'Server Version:'.

Comment on lines -50 to +49
- name: Set fact when its non-prod node
- name: Set fact
ansible.builtin.set_fact:
requester_hostname: "requester.{{ ansible_ec2_tags_instance_Env | lower }}.labdao.xyz"
ipfs_hostname: "ipfs.{{ ansible_ec2_tags_instance_Env | lower }}.labdao.xyz"
when: ansible_ec2_tags_instance_Env is defined and ansible_ec2_tags_instance_Env | lower != "prod"
receptor_hostname: "receptor.{{ ansible_ec2_tags_instance_Env | lower }}.labdao.xyz"
when: ansible_ec2_tags_instance_Env is defined
Copy link
Collaborator Author

@alabdao alabdao Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all envs now following something,<env>.labdao.xyz approach

Comment on lines +70 to +74
- name: Set receptor url
ansible.builtin.set_fact:
receptor_url: "http://{{ receptor_hostname }}:8080/judge"
when: receptor_hostname is defined

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

determining receptor URL from env

Comment on lines +75 to +83
- name: Ensure path to bacalhau dir exists
become: true
ansible.builtin.file:
path: /home/ubuntu/.bacalhau/
state: directory
mode: "0755"
owner: ubuntu
group: ubuntu

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating dir so we can push the config file there.

Comment on lines +100 to +112
- name: Flush handler to ensure Bacalhau is running
ansible.builtin.meta: flush_handlers

- name: Deploy config file
become: true
ansible.builtin.template:
src: "files/{{ bacalhau_node_type }}.yaml"
dest: /home/ubuntu/.bacalhau/config.yaml
owner: ubuntu
group: ubuntu
mode: "0644"
notify:
- Restart Bacalhau
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deploy the custom file.

Copy link
Contributor

@thetechnocrat-dev thetechnocrat-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, everything looks good.

Comment on lines +62 to +63
# - name: Pull common containers
# ansible.builtin.include_tasks: tasks/pull_common_containers.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the compute node will pull the container it needs the first time it runs the Job?

@alabdao alabdao merged commit 18014ad into main Oct 17, 2023
1 check passed
@alabdao alabdao deleted the ops/LAB-631-ansible-1.1.0 branch October 17, 2023 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants