-
Notifications
You must be signed in to change notification settings - Fork 192
Problems around updating Salt Minion (bundled) with Salt
Updating Salt with Salt is problematic because the update replaces
Python code and salt-minion
needs to be restarted to be able to use
the update code. One part that is contributing to this problem is the
Salt loader, which caches Python function objects. After Python
functions are changed by an update, the cache can be outdated and
produce stack traces. A similar problem is that parts of the Salt code
base is loaded into memory while other parts might not be in memory yet.
The on-disk code is getting updated while the in-memory code is not.
When the update is changing the internal API (e.g. changes to function
definitions), the in-memory code might do incorrect calls, again leading
to stack traces.
With traditional Salt packages there is an addition cause for Salt
loader problems: they can also happen when dependencies are updated,
including Python itself. Vendoring all used dependencies in the bundle
solves this specific problem, but keeps the general problem of updating
salt-minion
without restarting it.
Now that we have established that a salt-minion
restart is needed, we
run into the next problem: salt-minion
can lose Salt jobs when it is
getting restarted. salt-minion
does not implement persistent Salt
jobs. In other words, it loses information when it is restarted. The
consequence is that the restart must be timed in a way that it happens
after the current job is done.
If we add these problems together we see that we need to update
salt-minion
as soon as possible after updating the code and after
everything in the Salt job is done. One way to work with this is to
always to the salt-minion
update last. That way it's possible to
systemctl restart salt-minion
right after the update without losing
other states, as those are already finished.
This solution is already implemented in different parts of Uyuni, but it has a big issue: the implementation is done in Salt states. Users might bring their own states that cause the described problem.
We're taking care of installing our Salt minion package with
Uyuni-provided states that contain order: last
. Installing a Salt
minion only happens during bootstrap over Salt SSH and does not suffer
from the described problem.
Updates to salt-minion
happen in a different state (id: mgr_update_stack_patches
) than regular updates, but in the same state
file (patchinstall.sls
). While the installation is done before other
installations, this is normally not a source of problems. rpm
triggers
a systemctl restart salt-minion
in the packages %post
scriptlet, causing systemd
to
send SIGTERM
to salt-minion
. This signal is caught and postponed,
the restart happens once the state execution is done. (I am not sure
when exactly the restart happens, i.e. if it is after the completion of
this sls file or after executing other states as well.)
Product/SP Migration calls zypper dup
under the hood. In this case a
salt-minion
package up-/downgrade is not done separately. In practice,
the spmigration
state works without a problem, but there can be a race
condition where Uyuni sends a follow-up job too quickly while
salt-minion
is being restarted. This was fixed in
uyuni-project/uyuni#3937.
We don't currently do anything about this use-case. A user might visit
"System -> States -> Packages" and set salt-minion
to
"Installed/Latest". Doing this includes package updates into the
system's highstate whenever it becomes available.
Implementation-wise the "Installed/Latest" setting is translated into a
pkg.latest
state without an explicit ordering. Multiple packages, if
configured as "Installed/Latest" are added to the pkgs
list of the
same pgk.latest
state, but that is not the problem. The problem here
is that the state execution happens an an unknown time, likely somewhere
in the middle of all states that are part of the highstate.
This solution is implemented inside the bundle, which also works with a
state that install the venv-salt-minion
update first.
How could this work? The venv-salt-minion
update does not replace any
file, it only adds a new version of the bundle in its own location. The
old salt-minion
process sticks around and uses its share if files,
even after the update. Then, when salt-minion
is idle, the restart
happens which uses the newly installed bundle.
This solution requires the ability to install multiple versions of
venv-salt-minion
at a given point in time. It is only possible in
SUSE-family distributions to do that, RHEL-family and Debian-family
distributions don't support something like
Provides: multiversion(venv-salt-bundle)
.
A potential workaround that I haven't researched further is to not use native packages and instead come up with another deployment strategy. Something like creating a tarball on the Uyuni server, copying it to the client and extracting it.
Triggering the restart at the correct moment needs knowledge about the
current workload. The best component to know what is going on is
salt-minion
itself. There might be a way to monitor if all received
jobs are done, I haven't looked at this closely.
The jobs are executed by worker threads/subprocesses which send the
results back and the main salt-minion
process might be oblivious to
the job status. But since there are utility functions that e.g. cause
salt-minion
to terminate all jobs, some introspection capabilities to
hook into are probably available.
salt-minion
already communicates with systemd via the
sd_notify
protocol. WATCHDOG=trigger
can be send from the process to systemd to
trigger the systemd watchdog. Our systemd service definition includes
Restart=on-failure
, which include a "watchdog timeout". Using the
mentioned WATCHDOG=trigger
is equivalent to a watchdog timeout.
This solution depends first and foremost on the availability of
co-installable venv-salt-minion
packages. Without being able to
install two versions at once, we can't keep the old version when we
update to the new one.
Only SUSE-family distributions support this package management feature, which is not enough for Uyuni. We would need to bypass the native package management of our clients to co-install different versions of the Salt bundle.
The idea is to enable salt-minion
to save its current state, restart
and resume. This would be a change to the very core of Salt, which is
not easy from both a technological and a political point of view. Such
changes must be discussed with upstream and these discussions take a lot
of time.
I haven't spend much time on this idea in the scope of researching solutions, but it might be the best technical solution.
Uyuni has implemented Action Chains using the same idea, but with different mechanisms. We could revisit and probably simplify Action Chains if persistent states work.