Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support moving state to persistent storage #173

Merged
merged 70 commits into from
Sep 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
53449cd
support moving state to persistent storage
sjpb Apr 21, 2022
51baf3b
update openhpc role to fix offboard state
sjpb Apr 21, 2022
a328906
set podman volume storage path (rather than moving podman /home/rocky)
sjpb Apr 28, 2022
596eb14
use volume for appliances_state_dir in smslabs
sjpb Apr 28, 2022
3ad5e83
supply defaults for packer build options
sjpb Apr 28, 2022
c996976
omit block device configuration in packer build
sjpb Apr 28, 2022
08251f2
Merge branch 'main' into feature/offboard-state
sjpb Apr 29, 2022
aaddf02
use cloud-init rather than block_devices to create state directory
sjpb May 3, 2022
0a6f189
Merge branch 'fix/packer-build' into feature/offboard-state
sjpb May 4, 2022
06bff56
document problem with block_devices in packer build
sjpb May 4, 2022
98846bd
bump openhpc role version after merging PRs
sjpb May 5, 2022
7181d99
Merge 'main' (with smslabs+arcus CI and common terraform) into featur…
sjpb May 18, 2022
c4b165c
Merge branch 'main' into feature/offboard-state
sjpb May 26, 2022
72d6fd0
Merge branch 'main' into feature/offboard-state
sjpb Jul 5, 2022
9893c35
Create users with fixed uids/gids to own all persistent state ...
sjpb Jul 6, 2022
73208b5
Run hpctests before reimage and check sacct data survives reimage
sjpb Jul 6, 2022
6140e26
add variable for state volume size & bump default to be larger than d…
sjpb Jul 6, 2022
38f62a5
add docs for appliances_state_dir and default TF
sjpb Jul 6, 2022
c8d6ef9
make sacct check no-change
sjpb Jul 6, 2022
4a872c1
fix prometheus user $HOME when using appliances_state_dir
sjpb Jul 14, 2022
bc65971
fix mysql auth failures after rebuild with genericcloud image
sjpb Jul 19, 2022
40da4de
add squid proxy on smslabs
sjpb Jul 19, 2022
3197f3b
Merge branch 'main' into feature/offboard-state
sjpb Aug 16, 2022
086c514
don't move podman volume roots to appliances_state_dir - should bind …
sjpb Aug 16, 2022
c87e424
move opendistro data to host directory, (optionally) defined by appli…
sjpb Aug 18, 2022
727e3f6
Merge branch 'main' into feature/offboard-state
sjpb Aug 18, 2022
f0a97f7
move MPI tests to before reimage, to check slurm/monitoring state sur…
sjpb Aug 18, 2022
3bb779b
fix appliances_state_dir being missing in packer build
sjpb Aug 24, 2022
e5f5cb7
use appliances_state_dir for opendistro
sjpb Aug 24, 2022
f308dee
make control userdata more obvious
sjpb Aug 24, 2022
0a333f3
add note on arcus /etc/hosts workaround limits/benefits
sjpb Aug 25, 2022
8624be3
add hacky workaround for state dir unit file dependencies
sjpb Aug 25, 2022
2a25f4c
restart mysql after reimaging control
sjpb Aug 30, 2022
71509c3
don't update mysql root password
sjpb Aug 30, 2022
fb88dc3
wip: use containerised mysql instead of geerlingguy.mysql role
sjpb Aug 30, 2022
2a323ba
add state/enabled control for mysql
sjpb Aug 30, 2022
a9705c8
make mysql depend on datadir mountpoint
sjpb Aug 30, 2022
50c4f44
remove drop-in fudge for mysql now handled in its own unit file
sjpb Aug 30, 2022
f571801
change arcus CI to use image without mariadb-server package now using…
sjpb Aug 30, 2022
eead15f
move arcus debugging info from post- hook to pre- hook
sjpb Aug 31, 2022
1d2b290
debug mysql startup failure
sjpb Aug 31, 2022
dc79100
Merge branch 'main' into feature/offboard-state
sjpb Aug 31, 2022
612b322
make mysql root password more secure
sjpb Aug 31, 2022
2e2ccc7
fix mysql SELinux context
sjpb Aug 31, 2022
daf518e
fix opendistro SELinux context
sjpb Aug 31, 2022
3a1295d
update arcus builder image to use image without mariadb-server
sjpb Aug 31, 2022
6f916fb
fix error starting containers due to low per-user kernel key limits
sjpb Aug 31, 2022
28f825d
remove mysql restart from CI
sjpb Aug 31, 2022
19ff7c2
remove now-uneeded mysql user
sjpb Aug 31, 2022
4457328
add volume for /etc/exports -> /home in default TF/config
sjpb Aug 31, 2022
5ec59e4
add missing mysql users default
sjpb Aug 31, 2022
ea8fb48
add mysql readme
sjpb Aug 31, 2022
c51ebdf
increase retries when waiting for mysql initialisation
sjpb Sep 1, 2022
93a28a6
improve mysql readme
sjpb Sep 5, 2022
3122250
rename mysql:mysql_enabled to mysql:mysql_systemd_service_enabled
sjpb Sep 5, 2022
82ea5f4
make opendistro systemd unitfile more readable
sjpb Sep 5, 2022
18703e3
remove partitions from default TF volumes
sjpb Sep 5, 2022
0f164cf
remove mysql_*_login_details variables
sjpb Sep 5, 2022
1768d76
update block_devices README
sjpb Sep 5, 2022
0448b2d
don't try to start mysql and create users/dbs during packer build
sjpb Sep 6, 2022
7e12add
require /exports/home mounted before starting nfs-server, in default TF
sjpb Sep 7, 2022
3f8b3b6
don't NFS-mount /home on control, in defaults
sjpb Sep 7, 2022
34a222d
ensure podman available for mysql
sjpb Sep 8, 2022
38a66c8
try to fix /exports/home mount
sjpb Sep 8, 2022
a41a59a
unhackfiy systemd unit modifications for appliances_state_dir
sjpb Sep 8, 2022
2f455b9
cleanup docs for persistent state
sjpb Sep 8, 2022
4121939
allow multiple persistent systemd services per role
sjpb Sep 8, 2022
4d51467
link hosts running systemd unit adjustments to placement of correspon…
sjpb Sep 12, 2022
5351c64
make systemd role more generic
sjpb Sep 12, 2022
7b206fc
move systemd unit modifications to start of site.yml
sjpb Sep 12, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions .github/workflows/stackhpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,16 @@ jobs:
OS_CLOUD: openstack
ANSIBLE_FORCE_COLOR: True
TEST_USER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}


- name: Run MPI-based tests
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
ansible-playbook -vv ansible/adhoc/hpctests.yml
env:
ANSIBLE_FORCE_COLOR: True
OS_CLOUD: openstack

- name: Confirm Open Ondemand is up (via SOCKS proxy)
run: |
. venv/bin/activate
Expand Down Expand Up @@ -154,11 +163,11 @@ jobs:
OS_CLOUD: openstack
ANSIBLE_FORCE_COLOR: True

- name: Run MPI-based tests
- name: Check sacct state survived reimage
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
ansible-playbook -vv ansible/adhoc/hpctests.yml
ansible-playbook -vv ansible/ci/check_sacct_hpctests.yml
env:
ANSIBLE_FORCE_COLOR: True
OS_CLOUD: openstack
Expand Down
4 changes: 4 additions & 0 deletions ansible/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,7 @@ roles/*
!roles/slurm_exporter/**
!roles/firewalld/
!roles/firewalld/**
!roles/mysql/
!roles/mysql/**
!roles/systemd/
!roles/systemd/**
24 changes: 22 additions & 2 deletions ansible/bootstrap.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,36 @@
- hosts: cluster
gather_facts: false
tasks:
- name: Add groups
ansible.builtin.group: "{{ item.group }}"
loop: "{{ appliances_local_users }}"
when:
- item.enable | default(true) | bool
- "'group' in item"
become_method: "sudo"
# Need to change working directory otherwise we try to switch back to non-existent directory.
become_flags: '-i'
become: true
- name: Add users
ansible.builtin.user: "{{ item }}"
with_items: "{{ appliances_local_users }}"
ansible.builtin.user: "{{ item.user }}"
loop: "{{ appliances_local_users }}"
when: item.enable | default(true) | bool
become_method: "sudo"
# Need to change working directory otherwise we try to switch back to non-existent directory.
become_flags: '-i'
become: true
- name: Reset ssh connection to allow user changes to affect ansible_user
meta: reset_connection

- hosts: systemd
become: yes
gather_facts: false
tags: systemd
tasks:
- name: Make systemd unit modifications
import_role:
name: systemd

- hosts: selinux
gather_facts: false
become: yes
Expand Down
27 changes: 27 additions & 0 deletions ansible/ci/check_sacct_hpctests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
- hosts: control
gather_facts: false
become: true
vars:
sacct_stdout_expected: |- # based on CI running hpctests as the first job - NB note no trailing newline
JobID,JobName,State
sjpb marked this conversation as resolved.
Show resolved Hide resolved
2,pingpong.sh,COMPLETED
3,pingmatrix.sh,COMPLETED
4,hpl-build-linux64.sh,COMPLETED
5_0,hpl-solo.sh,COMPLETED
5_1,hpl-solo.sh,COMPLETED
tasks:
- name: Get info for ended jobs
shell:
cmd: sacct --format=jobid,jobname,state --allocations --parsable2 --delimiter=, --starttime=now-1days --endtime=now
# by default start/end time is midnight/now which is not robust
changed_when: false
register: sacct
- name: Check info for ended jobs
assert:
that: sacct.stdout == sacct_stdout_expected
fail_msg: |
Expected:
--{{ sacct_stdout_expected }}--
Got:
--{{ sacct.stdout }}--
success_msg: sacct shows hpctests jobs as first and only jobs
10 changes: 8 additions & 2 deletions ansible/filter_plugins/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
# Apache 2 License

from ansible.errors import AnsibleError, AnsibleFilterError
from ansible.utils.display import Display
from collections import defaultdict
import jinja2
from ansible.module_utils.six import string_types
Expand Down Expand Up @@ -36,10 +37,15 @@ def exists(fpath):
class FilterModule(object):
''' Ansible core jinja2 filters '''

def warn(self, message, **kwargs):
Display().warning(message)
return message

def filters(self):
return {
# jinja2 overrides
'readfile': readfile,
'prometheus_node_exporter_targets': prometheus_node_exporter_targets,
'exists': exists
}
'exists': exists,
'warn': self.warn
}
8 changes: 5 additions & 3 deletions ansible/roles/block_devices/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
block_devices
=============

Manage filesystems on block devices, including creating partitions, creating filesystems and mounting filesystems.
Manage filesystems on block devices (such as OpenStack volumes), including creating partitions, creating filesystems and mounting filesystems.

This is a convenience wrapper around the ansible modules:
- community.general.parted
- community.general.filesystem
- ansible.buildin.file
- ansible.posix.mount

It includes logic to handle OpenStack-provided volumes appropriately both for appliance instances and the Packer build VM.
To avoid issues with device names changing after e.g. reboots, devices are identified by serial number and mounted by filesystem UUID.
sjpb marked this conversation as resolved.
Show resolved Hide resolved

To avoid issues with device names changing after e.g. reboots, devices are identified by serial number and mounted by filesystem UUID.
**NB:** This role is ignored[^1] during Packer builds as block devices will not be attached to the Packer build VMs. This role is therefore deprecated and it is suggested that `cloud-init` is used instead. See e.g. `environments/skeleton/{{cookiecutter.environment}}/terraform/control.userdata.tpl`.

[^1]: See `environments/common/inventory/group_vars/builder/defaults.yml`

Requirements
------------
Expand Down
5 changes: 5 additions & 0 deletions ansible/roles/block_devices/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
- name: Warn role is deprecated
debug:
msg: "{{ 'Role block_devices is deprecated, see ansible/roles/block_devices/README.md' | warn }}"
when: block_devices_configurations | length > 0

- name: Enumerate block device paths by serial number
block_devices:
register: _block_devices
Expand Down
52 changes: 52 additions & 0 deletions ansible/roles/mysql/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
mysql
=====

Deploy containerised `mysql` server using Podman.


Requirements
------------

None.

Role Variables
--------------

- `mysql_root_password`: Required str. Password to set for `root` mysql user. **NB** This cannot be changed by this role once mysql server has initialised.
- `mysql_tag`: Optional str. Tag for version of `mysql` container image to use. Default `8.0.30`.
sjpb marked this conversation as resolved.
Show resolved Hide resolved
- `mysql_systemd_service_enabled`: Optional bool. Whether `mysql` service starts on boot. Default `yes`.
- `mysql_state`: Optional str. As per `ansible.builtin.systemd:state`. Default is `started` or `restarted` as required.
- `mysql_podman_user`: Optional str. User running `podman`. Default `{{ ansible_user }}`.
- `mysql_datadir`: Optional str. Path to data directory on the host to store databases etc. Default `/var/lib/mysql`. Note all path components will be created and user set appropriately if this does not exist.
- `mysql_host`: Optional str. Address of host. Default `{{ inventory_hostname }}`.
- `mysql_users`: Optional list of dicts defining users as per `community.mysql.mysql_user`. Default `[]`.
- `mysql_databases`: Optional list of dicts defining databases as per `community.mysql.mysql_db`. Default `[]`.

Dependencies
------------

None.

Example Playbook
----------------

```yaml
- name: Setup DB
hosts: mysql
become: true
tags:
- mysql
tasks:
- include_role:
name: mysql
```

License
-------

Apache v2

Author Information
------------------

Steve Brasier steveb@stackhpc.com
11 changes: 11 additions & 0 deletions ansible/roles/mysql/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# required:
# mysql_root_password: # TODO: make it possible to CHANGE root password

mysql_tag: 8.0.30
mysql_systemd_service_enabled: yes
#mysql_state: # default is started or restarted as required
mysql_podman_user: "{{ ansible_user }}"
mysql_datadir: /var/lib/mysql
mysql_mysqld_options: [] # list of str options to mysqld, see `run -it --rm mysql:tag --verbose --help`
mysql_users: [] # list of dicts for community.mysql.mysql_user
mysql_databases: [] # list of dicts for community.mysql.mysql_db
37 changes: 37 additions & 0 deletions ansible/roles/mysql/tasks/configure.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
- name: Create environment file for mysql server root password
# NB: This doesn't trigger a restart on changes as it will be ignored once mysql is initialised
copy:
dest: /etc/sysconfig/mysqld
content: |
MYSQL_INITIAL_ROOT_PASSWORD='{{ mysql_root_password }}'
owner: root
group: root
mode: u=rw,go=

- name: Ensure mysql service state
systemd:
name: mysql
state: "{{ mysql_state | default('restarted' if _mysql_unitfile.changed else 'started') }}"
enabled: "{{ mysql_systemd_service_enabled }}"
daemon_reload: "{{ _mysql_unitfile.changed }}"

- block:
- name: Wait for mysql to initialise
# NB: It is not sufficent to wait_for the port
community.mysql.mysql_info:
login_user: root
login_password: "{{ mysql_root_password }}"
# no_log: true # TODO: FIXME
register: _mysql_info
until: "'version' in _mysql_info"
retries: 60
delay: 2

- name: Ensure mysql databases created
community.mysql.mysql_db: "{{ item }}"
loop: "{{ mysql_databases}}"

- name: Ensure mysql users present
community.mysql.mysql_user: "{{ item }}"
loop: "{{ mysql_users }}"
when: "mysql_state | default('unspecified') != 'stopped'"
10 changes: 10 additions & 0 deletions ansible/roles/mysql/tasks/install.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
- name: Install python mysql client
pip:
name: pymysql
state: present

- name: Create systemd mysql container unit file
template:
dest: /etc/systemd/system/mysql.service
src: mysql.service.j2
register: _mysql_unitfile
2 changes: 2 additions & 0 deletions ansible/roles/mysql/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
- import_tasks: install.yml
- import_tasks: configure.yml
41 changes: 41 additions & 0 deletions ansible/roles/mysql/templates/mysql.service.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# mysql.service

[Unit]
Description=Podman container mysql.service
Documentation=man:podman-generate-systemd(1)
Wants=network.target
After=network-online.target
RequiresMountsFor={{ mysql_datadir }} /etc/sysconfig/mysqld

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=always
EnvironmentFile=/etc/sysconfig/mysqld
# The above EnvironmentFile must define MYSQL_INITIAL_ROOT_PASSWORD
ExecStartPre=+install -d -o {{ mysql_podman_user }} -g {{ mysql_podman_user }} -Z container_file_t {{ mysql_datadir }}
ExecStart=/usr/bin/podman run \
--network slirp4netns:cidr={{ podman_cidr }} \
--sdnotify=conmon --cgroups=no-conmon \
--detach --replace --name mysql --restart=no \
--user mysql \
--volume {{ mysql_datadir }}:/var/lib/mysql:U \
--publish 3306:3306 \
-e MYSQL_ROOT_PASSWORD=${MYSQL_INITIAL_ROOT_PASSWORD} \
mysql:{{ mysql_tag }}{%- for opt in mysql_mysqld_options %} \
--{{ opt }}{% endfor %}

ExecStop=/usr/bin/podman stop --ignore mysql -t 10
# note for some reason this returns status=143 which makes systemd show the unit as failed, not stopped
ExecStopPost=/usr/bin/podman rm --ignore -f mysql
SuccessExitStatus=143 SIGTERM
KillMode=none
Type=notify
NotifyAccess=all
LimitNOFILE=65536
LimitMEMLOCK=infinity
User={{ mysql_podman_user }}
Group={{ mysql_podman_user }}
TimeoutStartSec=180

[Install]
WantedBy=multi-user.target default.target
1 change: 1 addition & 0 deletions ansible/roles/opendistro/defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
#opendistro_internal_users_path:

opendistro_podman_user: "{{ ansible_user }}"
opendistro_data_path: "/usr/share/elasticsearch/data" # path to host data directory
16 changes: 15 additions & 1 deletion ansible/roles/opendistro/templates/opendistro.service.j2
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,21 @@ After=network-online.target
[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=always
ExecStart=/usr/bin/podman run --network slirp4netns:cidr={{ podman_cidr }} --sdnotify=conmon --cgroups=no-conmon -d --replace --name opendistro --restart=no --user elasticsearch --ulimit memlock=-1:-1 --ulimit nofile=65536:65536 --volume opendistro:/usr/share/elasticsearch/data --volume /etc/elastic/internal_users.yml:/usr/share/elasticsearch/plugins/opendistro_security/securityconfig/internal_users.yml:ro --env node.name=opendistro --env discovery.type=single-node --env bootstrap.memory_lock=true --env "ES_JAVA_OPTS=-Xms512m -Xmx512m" --publish 9200:9200 amazon/opendistro-for-elasticsearch:1.12.0
ExecStartPre=+install -d -o {{ opendistro_podman_user }} -g {{ opendistro_podman_user }} -Z container_file_t {{ opendistro_data_path }}
ExecStart=/usr/bin/podman run \
--network slirp4netns:cidr={{ podman_cidr }} \
--sdnotify=conmon --cgroups=no-conmon \
--detach --replace --name opendistro --restart=no \
--user elasticsearch \
--ulimit memlock=-1:-1 --ulimit nofile=65536:65536 \
--volume {{ opendistro_data_path }}:/usr/share/elasticsearch/data:U \
--volume /etc/elastic/internal_users.yml:/usr/share/elasticsearch/plugins/opendistro_security/securityconfig/internal_users.yml:ro \
--env node.name=opendistro \
--env discovery.type=single-node \
--env bootstrap.memory_lock=true \
--env "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
--publish 9200:9200 \
amazon/opendistro-for-elasticsearch:1.12.0
ExecStop=/usr/bin/podman stop --ignore opendistro -t 10
# note for some reason this returns status=143 which makes systemd show the unit as failed, not stopped
ExecStopPost=/usr/bin/podman rm --ignore -f opendistro
Expand Down
9 changes: 6 additions & 3 deletions ansible/roles/podman/tasks/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@
dest: /etc/security/limits.d/custom.conf
become: true

- name: Up default keys permitted
ansible.posix.sysctl:
name: kernel.keys.maxkeys # /proc/sys/kernel/keys/maxkeys
value: 50000
become: true

- name: reset ssh connection to allow user changes to affect 'current login user'
meta: reset_connection

Expand Down Expand Up @@ -60,9 +66,6 @@
become: yes
register: podman_tmp

- debug:
var: podman_tmp

- name: Reset podman database
# otherwise old config overrides!
command:
Expand Down
19 changes: 19 additions & 0 deletions ansible/roles/systemd/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# systemd

Create drop-in files for systemd services.

# Role Variables
- `systemd_dropins`: Required. A mapping where keys = systemd service name, values are a dict as follows:
- `group`: Required str. Inventory group this drop-in applies to.
- `comment`: Optional str. Comment describing reason for drop-in.
- `content`: Required str. Content of drop-in file.
# systemd

Create drop-in files for systemd services.

# Role Variables
- `systemd_dropins`: Required. A mapping where keys = systemd service name, values are a dict as follows:
- `group`: Required str. Inventory group this drop-in applies to.
- `comment`: Optional str. Comment describing reason for drop-in.
- `content`: Required str. Content of drop-in file.
- `systemd_restart`: Optional bool. Whether to reload unit definitions and restart services. Default `false`.
Loading