-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support moving state to persistent storage #173
Conversation
AlaSKA disk usage is as follows:
Total (after accounting for $HOME problem): ~5GB Last jobid is 500. |
Note before this commit the podman user was created on ALL nodes, now only on nodes in "podman" group
Using this on a deployment, the recreation of a cluster fails with
|
Despite the CI there also appears to be a problem with 9893c35, where the firewalld install task fails with a directory error on |
9893c35 also seems problematic in a deployment with this:
Problem is:
with
conflicts with
|
environments/skeleton/{{cookiecutter.environment}}/terraform/control.userdata.tpl
Outdated
Show resolved
Hide resolved
/dev/vdb: | ||
table_type: gpt | ||
layout: true | ||
/dev/vdc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we control for device names here? /dev/vda
vs /dev/sda
etc - these are properties of the SCSI device I think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't really. It is supposed to depend on the flags set on the image, so setting --property hw_scsi_model=virtio-scsi
should I think give you sd*
although in my test it didn't. For the use of this TF I think it's ok - in CI we control both the device name here and the image tags. If you're using this TF in your own deployment you may need to change it but that goes for anything
TODO: do I need to delay nfs server start till after /home mounted from /etc/exports? |
This might be useful: https://www.freedesktop.org/software/systemd/man/systemd.mount.html#x-systemd.wanted-by= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks okay once that UUID/LABEL docs change is complete
appliances_state_dir
to put control node state onto persistent storage.Ticket: https://stackhpc.atlassian.net/browse/DEV-834
In the skeleton TF (as used for
arcus
):home
andstate
are created and attached to the control node/var/lib/state
and/exports/home
on the control nodeappliances_state_dir
for the control group as/var/lib/state
. NB this must be defined on the group, not the host, so that the Packer builds forcontrol
images also get this set.Currently the persistent state on the slurm control node covers:
It also adds documentation for this feature, and modifies the
block_devices
docs to explain why this isn't the right way to use volumes.CI (
arcus
environment)This now:
slurm.yml
playbook after the reimage to regenerate partition information (this is excluded from the control image build info)hpctests
runs is still present after the reimage:a) from
sacct
, which checks mysql state has persisted and slurmdbd restartedb) from opendistro proxied through grafana, which checks the opendistro state has been persisted and the datasource works
Manual checks
I have (manually) checked that after a reimage as above:
hpctests
worksCaveats
ansible/slurm.yml
must be rerun after reimaging the control node to redefine partition information/exports/home
exists, which assumes the default TF (or similar) is used.mysql
role can't change the mysql root password after initialisation.Requires/TODOs: