-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LAB-631] ansible 1.1.0 #706
Conversation
alabdao
commented
Oct 17, 2023
- supporting latest prod deployment
- avoid pulling container since they cause slowdown
- setting receptor url
- new prod setup
- removed prod
- Ability to deploy custom file
- install config file at the end after bacalhau repo has been intialized
- add instance id to labels
- [LAB-595] Dynamically determining if GPUs are available and how many
- casting to int
- fixing query
- fixing bacalhau client version defterminatino command
- making limit memory determination dynamic
- gather facts true
- accept networked jobs
- removing quotes
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
--limit-job-memory 12gb \ | ||
{% if gpu %} | ||
--limit-job-gpu 1 \ | ||
--limit-job-memory {{ (ansible_memtotal_mb | int * 0.80) | round | int }}Mb \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dynamically determining total memory and setting limit to 80% of that instead of hardcoded 12Gb which is low for some instances.
{% if num_of_gpus | int > 0 %} | ||
--limit-job-gpu {{ num_of_gpus | int }} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamically setting number of gpus. This will support non GPU instances as well as instances with GPU > 1 instead of hardcoded to be always 1
@@ -0,0 +1,67 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since cli flags dont wory anymore, setting configurations in the file.
JobExecutionTimeoutClientIDBypassList: [] | ||
JobNegotiationTimeout: 3m0s | ||
MinJobExecutionTimeout: 500ms | ||
MaxJobExecutionTimeout: {{ bacalhau_compute_max_job_execution_timeout | default('24h') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allowing override for max execution timeout. set to 24h for now.
JobSelection: | ||
Locality: anywhere | ||
RejectStatelessJobs: false | ||
AcceptNetworkedJobs: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting it to true
here since cli flags dont work atm.
vars: | ||
nvidia_distribution: ubuntu2004 | ||
ipfs_version: "0.18.0" | ||
ipfs_path: "/opt/ipfs" | ||
gpu: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamic now.
# Get GPU info from system | ||
- name: Get lshw display info | ||
become: true | ||
ansible.builtin.command: lshw -c display -json | ||
changed_when: true | ||
register: lshw_output | ||
|
||
- name: set number of gpus available | ||
vars: | ||
query: "[?vendor=='NVIDIA Corporation']" | ||
ansible.builtin.set_fact: | ||
num_of_gpus: "{{ lshw_output.stdout | from_json | json_query(query) | length }}" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get number of GPUs for lshw
command.
# - name: Pull common containers | ||
# ansible.builtin.include_tasks: tasks/pull_common_containers.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pulling contains causes serious load on the system. disabling this for now until figuring out something better here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the compute node will pull the container it needs the first time it runs the Job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's correct.
Potentially fix here would be to use packer to create compute image which has Convexity select containers already available.
@@ -1,6 +1,6 @@ | |||
# Try running Bacalhau first, to see what version it is. | |||
- name: Check bacalhau version | |||
ansible.builtin.command: /usr/local/bin/bacalhau version | |||
ansible.builtin.command: /usr/local/bin/bacalhau version --client --no-style --hide-header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only get currently installed bacalhau version without the prettiness.
@@ -9,7 +9,7 @@ | |||
|
|||
- name: Set fact for currently installed version | |||
ansible.builtin.set_fact: | |||
bacalhau_installed_version: "{{ existing_bacalhau_version.stdout.split('Server Version: ')[1] }}" | |||
bacalhau_installed_version: "{{ existing_bacalhau_version.stdout | trim }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output has now changed from 'Server Version:'.
- name: Set fact when its non-prod node | ||
- name: Set fact | ||
ansible.builtin.set_fact: | ||
requester_hostname: "requester.{{ ansible_ec2_tags_instance_Env | lower }}.labdao.xyz" | ||
ipfs_hostname: "ipfs.{{ ansible_ec2_tags_instance_Env | lower }}.labdao.xyz" | ||
when: ansible_ec2_tags_instance_Env is defined and ansible_ec2_tags_instance_Env | lower != "prod" | ||
receptor_hostname: "receptor.{{ ansible_ec2_tags_instance_Env | lower }}.labdao.xyz" | ||
when: ansible_ec2_tags_instance_Env is defined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all envs now following something,<env>.labdao.xyz
approach
- name: Set receptor url | ||
ansible.builtin.set_fact: | ||
receptor_url: "http://{{ receptor_hostname }}:8080/judge" | ||
when: receptor_hostname is defined | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
determining receptor URL from env
- name: Ensure path to bacalhau dir exists | ||
become: true | ||
ansible.builtin.file: | ||
path: /home/ubuntu/.bacalhau/ | ||
state: directory | ||
mode: "0755" | ||
owner: ubuntu | ||
group: ubuntu | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating dir so we can push the config file there.
- name: Flush handler to ensure Bacalhau is running | ||
ansible.builtin.meta: flush_handlers | ||
|
||
- name: Deploy config file | ||
become: true | ||
ansible.builtin.template: | ||
src: "files/{{ bacalhau_node_type }}.yaml" | ||
dest: /home/ubuntu/.bacalhau/config.yaml | ||
owner: ubuntu | ||
group: ubuntu | ||
mode: "0644" | ||
notify: | ||
- Restart Bacalhau |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deploy the custom file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, everything looks good.
# - name: Pull common containers | ||
# ansible.builtin.include_tasks: tasks/pull_common_containers.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the compute node will pull the container it needs the first time it runs the Job?