Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge case with switch_config_applied and M&R provisioned: true/false and validation #414

Closed
sadsfae opened this issue Aug 22, 2022 · 0 comments

Comments

@sadsfae
Copy link
Member

sadsfae commented Aug 22, 2022

Overview

This covers what is likely a very rare race condition between passing hosts from the M&R phase to the VE phase of QUADS release and if someone needs to intervene with VSC on the ports.

== Moving hosts == @ Sun Aug 21 22:30:01 UTC 2022
There was something wrong updating switch for e23-h14-b01-fc640.example.com:em3
There was something wrong configuring the switch.
NoneType: None

In the above example, a host had problems with the M&R call of cli.py to apply their switchport settings. An operator intervened and instead fixed them with verify_switch_conf.py --host and thus cli.py did not get to set the host-level value of switch_config_applied which ultimately caused the cloud-level value provisioned: true to not be set and therefore VE refused to process and validate the environment.

Background

When hosts are processed through M&R via cli.py they ultimately get the database flag recorded switch_config_applied at the host level if their switch configs are applied.

host_obj.update(switch_config_applied=True)

In order for an environment to fully pass the VE phase (validate_env) of release each host needs to have its switch_config_applied set to true as well as other per-host flags that comprise the cloud-level value provisioned: true

clouds = Cloud.objects(validated=False, provisioned=True, name__ne="cloud01")

It is only the sum of all the hosts having their various values set that allow the cloud as whole to have provisioned: true set which lets VE process it.

The encompassing value for switch_config_applied is applied via cli.py by M&R here:

host_obj.update(switch_config_applied=True)

In cases where an operator needs to use verify_switch_conf.py to correct a port misconfguration on a maligned host in a pending environment there seems to be a small race condition present where because cli.py on behalf of M&R (move_and_rebuild_hosts.py) isn't the one to set this then regardless of the port settings now being correct VE does not honor it and thus will refuse to process it.

Workaround

The workaround here is to manually set "provisioned" : "true" at the cloud level to get VE to process it and ultimately release it.

db.cloud.update({name:"cloud05"}, {$set:{provisioned:true}})

Expected Behavior

I believe M&R has done it's job here and VE should catch this issue and warn/log just like it does with Foreman and netcat checks.

Instead in some rare cases like this one, VE will silently refuse to process hosts that might still not have provisioned: true (and thus a sub-flag not toggled like in this case switch_config_applied: false ) it should be more verbose via logger to tell us what host(s) and sub-flags are not compliant for promotion the VE phase of validation like it does with other scenarios.

In the case of switch_config_applied we might have VE instead have QUADS call verify_switch_conf to verify that no changes are pending for that host and then toggle this flag for us after warning via host_obj.update(switch_config_applied=True)

We have precedence here of VE taking corrective action as in the case of pragmatic system resets for example and it seems the right place for this to occur. M&R is already very complex and doesn't seem like the right place for corrective action outside it's preparation methods. Also, the M&R workflow is pretty explicit and doesn't keep provisioned: true from being set unless it really is (as it is able to see it)

provisioned = False

In other cases and other sub-flags under the umbrella of provisioned: true we will just want to warn as those depend on third-party or external systems like Foreman for example and may need operator intervention, but should still warn.

e.g. "build" : "true" being should warn only (I believe it does anyway)

Proposed Fix

M&R = the cruncher, the muncher of hosts to beat them into shape, it's the BIll & Ted phonebooth.
VE = the validator, the checker of how molded the hosts as a whole are to our standards and do they meet acceptance.

I think we overall want better logging for validate_env.py, even --debug doesn't tell us when it sees some prerequisites not being met (some but not all) and try to set per-host and per-cloud values back to the correct setting if they are already in their desired state.

Make VE check/set switch_config_applied

The proposed fix here is that in cases where we can at least check behind M&R to see if the reason behind one of the sub-flags of provisioned: true were not met and therefore provisioned: false is set at the host level is still valid.

Let's consider calling VSC as a libarary with the equivalent of --check and if no changes were to be made then make VE set switch_config_applied: true

Make VSC set switch_config_applied

In rare corner cases like this we should consider having VSC set switch_config_applied: true if it makes a change at the host-level. I think we should also entertain making it simply apply this value if the desired state meets the current state but the flags are not updated. It should not do this if it's not part of an active cloud.

@sadsfae sadsfae changed the title [BUG] Edge case with switch_config_applied and M&R provisioned: true/false and validation Edge case with switch_config_applied and M&R provisioned: true/false and validation Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant