You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This covers what is likely a very rare race condition between passing hosts from the M&R phase to the VE phase of QUADS release and if someone needs to intervene with VSC on the ports.
== Moving hosts == @ Sun Aug 21 22:30:01 UTC 2022
There was something wrong updating switch for e23-h14-b01-fc640.example.com:em3
There was something wrong configuring the switch.
NoneType: None
In the above example, a host had problems with the M&R call of cli.py to apply their switchport settings. An operator intervened and instead fixed them with verify_switch_conf.py --host and thus cli.py did not get to set the host-level value of switch_config_applied which ultimately caused the cloud-level value provisioned: true to not be set and therefore VE refused to process and validate the environment.
Background
When hosts are processed through M&R via cli.py they ultimately get the database flag recorded switch_config_applied at the host level if their switch configs are applied.
In order for an environment to fully pass the VE phase (validate_env) of release each host needs to have its switch_config_applied set to true as well as other per-host flags that comprise the cloud-level value provisioned: true
It is only the sum of all the hosts having their various values set that allow the cloud as whole to have provisioned: true set which lets VE process it.
The encompassing value for switch_config_applied is applied via cli.py by M&R here:
In cases where an operator needs to use verify_switch_conf.py to correct a port misconfguration on a maligned host in a pending environment there seems to be a small race condition present where because cli.py on behalf of M&R (move_and_rebuild_hosts.py) isn't the one to set this then regardless of the port settings now being correct VE does not honor it and thus will refuse to process it.
Workaround
The workaround here is to manually set "provisioned" : "true" at the cloud level to get VE to process it and ultimately release it.
I believe M&R has done it's job here and VE should catch this issue and warn/log just like it does with Foreman and netcat checks.
Instead in some rare cases like this one, VE will silently refuse to process hosts that might still not have provisioned: true (and thus a sub-flag not toggled like in this case switch_config_applied: false ) it should be more verbose via logger to tell us what host(s) and sub-flags are not compliant for promotion the VE phase of validation like it does with other scenarios.
In the case of switch_config_applied we might have VE instead have QUADS call verify_switch_conf to verify that no changes are pending for that host and then toggle this flag for us after warning via host_obj.update(switch_config_applied=True)
We have precedence here of VE taking corrective action as in the case of pragmatic system resets for example and it seems the right place for this to occur. M&R is already very complex and doesn't seem like the right place for corrective action outside it's preparation methods. Also, the M&R workflow is pretty explicit and doesn't keep provisioned: true from being set unless it really is (as it is able to see it)
In other cases and other sub-flags under the umbrella of provisioned: true we will just want to warn as those depend on third-party or external systems like Foreman for example and may need operator intervention, but should still warn.
e.g. "build" : "true" being should warn only (I believe it does anyway)
Proposed Fix
M&R = the cruncher, the muncher of hosts to beat them into shape, it's the BIll & Ted phonebooth.
VE = the validator, the checker of how molded the hosts as a whole are to our standards and do they meet acceptance.
I think we overall want better logging for validate_env.py, even --debug doesn't tell us when it sees some prerequisites not being met (some but not all) and try to set per-host and per-cloud values back to the correct setting if they are already in their desired state.
Make VE check/set switch_config_applied
The proposed fix here is that in cases where we can at least check behind M&R to see if the reason behind one of the sub-flags of provisioned: true were not met and therefore provisioned: false is set at the host level is still valid.
Let's consider calling VSC as a libarary with the equivalent of --check and if no changes were to be made then make VE set switch_config_applied: true
Make VSC set switch_config_applied
In rare corner cases like this we should consider having VSC set switch_config_applied: true if it makes a change at the host-level. I think we should also entertain making it simply apply this value if the desired state meets the current state but the flags are not updated. It should not do this if it's not part of an active cloud.
The text was updated successfully, but these errors were encountered:
sadsfae
changed the title
[BUG] Edge case with switch_config_applied and M&R provisioned: true/false and validation
Edge case with switch_config_applied and M&R provisioned: true/false and validation
Aug 22, 2022
Overview
This covers what is likely a very rare race condition between passing hosts from the M&R phase to the VE phase of QUADS release and if someone needs to intervene with VSC on the ports.
In the above example, a host had problems with the M&R call of
cli.py
to apply their switchport settings. An operator intervened and instead fixed them withverify_switch_conf.py --host
and thuscli.py
did not get to set the host-level value ofswitch_config_applied
which ultimately caused the cloud-level valueprovisioned: true
to not be set and therefore VE refused to process and validate the environment.Background
When hosts are processed through M&R via
cli.py
they ultimately get the database flag recordedswitch_config_applied
at the host level if their switch configs are applied.quads/quads/cli/cli.py
Line 1502 in 57dab39
In order for an environment to fully pass the VE phase (validate_env) of release each host needs to have its
switch_config_applied
set totrue
as well as other per-host flags that comprise the cloud-level valueprovisioned: true
quads/quads/tools/validate_env.py
Line 314 in 57dab39
It is only the sum of all the hosts having their various values set that allow the cloud as whole to have
provisioned: true
set which lets VE process it.The encompassing value for
switch_config_applied
is applied viacli.py
by M&R here:quads/quads/cli/cli.py
Line 1502 in 57dab39
In cases where an operator needs to use
verify_switch_conf.py
to correct a port misconfguration on a maligned host in a pending environment there seems to be a small race condition present where becausecli.py
on behalf of M&R (move_and_rebuild_hosts.py) isn't the one to set this then regardless of the port settings now being correct VE does not honor it and thus will refuse to process it.Workaround
The workaround here is to manually set
"provisioned" : "true"
at the cloud level to get VE to process it and ultimately release it.Expected Behavior
I believe M&R has done it's job here and VE should catch this issue and warn/log just like it does with Foreman and netcat checks.
Instead in some rare cases like this one, VE will silently refuse to process hosts that might still not have
provisioned: true
(and thus a sub-flag not toggled like in this caseswitch_config_applied: false
) it should be more verbose via logger to tell us what host(s) and sub-flags are not compliant for promotion the VE phase of validation like it does with other scenarios.In the case of
switch_config_applied
we might have VE instead have QUADS callverify_switch_conf
to verify that no changes are pending for that host and then toggle this flag for us after warning viahost_obj.update(switch_config_applied=True)
We have precedence here of VE taking corrective action as in the case of pragmatic system resets for example and it seems the right place for this to occur. M&R is already very complex and doesn't seem like the right place for corrective action outside it's preparation methods. Also, the M&R workflow is pretty explicit and doesn't keep
provisioned: true
from being set unless it really is (as it is able to see it)quads/quads/cli/cli.py
Line 1507 in 57dab39
In other cases and other sub-flags under the umbrella of
provisioned: true
we will just want to warn as those depend on third-party or external systems like Foreman for example and may need operator intervention, but should still warn.e.g.
"build" : "true"
being should warn only (I believe it does anyway)Proposed Fix
M&R = the cruncher, the muncher of hosts to beat them into shape, it's the BIll & Ted phonebooth.
VE = the validator, the checker of how molded the hosts as a whole are to our standards and do they meet acceptance.
I think we overall want better logging for
validate_env.py
, even--debug
doesn't tell us when it sees some prerequisites not being met (some but not all) and try to set per-host and per-cloud values back to the correct setting if they are already in their desired state.Make VE check/set switch_config_applied
The proposed fix here is that in cases where we can at least check behind M&R to see if the reason behind one of the sub-flags of
provisioned: true
were not met and thereforeprovisioned: false
is set at the host level is still valid.Let's consider calling VSC as a libarary with the equivalent of
--check
and if no changes were to be made then make VE setswitch_config_applied: true
Make VSC set switch_config_applied
In rare corner cases like this we should consider having VSC set
switch_config_applied: true
if it makes a change at the host-level. I think we should also entertain making it simply apply this value if the desired state meets the current state but the flags are not updated. It should not do this if it's not part of an active cloud.The text was updated successfully, but these errors were encountered: