Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update state transition for stopped workers #234

Merged
merged 1 commit into from
Mar 14, 2022

Conversation

giffels
Copy link
Member

@giffels giffels commented Mar 9, 2022

During the integration of OpenStack cloud resources for PUNCH4NFDI, I noticed an unexpected behaviour for Drones in AvailableState. In this state the status of the resource (OpenStack VM) and the machine status (availability in HTCondor) is checked. Both status are used to decided on a possible state transition.

Due to missing payloads, the HTCondor daemon on this node shutdown automatically causing the machine status to be NotAvailable. While the resource status continues to be Running. In that case the drone state is reset to IntegratingState. Since HTCondor is not restarted, the Drone remains in this state forever.

I would like to discuss the following change in this pull request. In case of the behaviour descriibed above, I would like to move the Drone into ShutDown state causing the VM to be stopped in OpenStack. What do you think?

@giffels giffels requested review from a team, maxfischer2781 and RHofsaess and removed request for a team March 9, 2022 10:31
@giffels giffels added the bug Something isn't working label Mar 9, 2022
@maxfischer2781
Copy link
Member

I would like to ponder this a bit, but it seems sensible to me. Overall our model is that individual drones are expendable, so being safe and discarding one seems preferable over trying to squeeze it back to live at all cost.

@codecov-commenter
Copy link

Codecov Report

Merging #234 (b3c8181) into master (a83ba0a) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #234   +/-   ##
=======================================
  Coverage   99.34%   99.34%           
=======================================
  Files          54       54           
  Lines        2139     2139           
=======================================
  Hits         2125     2125           
  Misses         14       14           
Impacted Files Coverage Δ
tardis/resources/dronestates.py 99.32% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a83ba0a...b3c8181. Read the comment docs.

@giffels
Copy link
Member Author

giffels commented Mar 9, 2022

I would like to ponder this a bit, but it seems sensible to me. Overall our model is that individual drones are expendable, so being safe and discarding one seems preferable over trying to squeeze it back to live at all cost.

Sure, that was the idea of this pull request. ;-)

@mschnepf
Copy link
Member

mschnepf commented Mar 9, 2022

I think this change is fine. Since the drone has to be in ResourceStatus.Running state, the OBS WN instance inside the drone had been started.
Since that, TARDIS can distinguish between a drone started and the OBS WN instance is not yet started and the OBS WN instance already shut down or maybe crashed.
Another option to avoid such behavior would be to configure an auto shut down for the drone. Thereby, the time before the drone shutdown itself should be at least two times the update time of the OBS adapter.
But I think your change in the drone state machine is better than "tuning" of settings.

@giffels
Copy link
Member Author

giffels commented Mar 9, 2022

Another option to avoid such behavior would be to configure an auto shut down for the drone. Thereby, the time before the drone shutdown itself should be at least two times the update time of the OBS adapter.

That is already in place as a work around. However, others could run into this problem, too. So, I think we should fix it.

Copy link
Member

@maxfischer2781 maxfischer2781 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your request has been pondered and deemed to be... worthy. Merge it so!

Copy link

@RHofsaess RHofsaess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@giffels giffels merged commit f89d96a into MatterMiners:master Mar 14, 2022
@giffels giffels deleted the update-state-transition branch March 14, 2022 10:58
giffels added a commit to giffels/tardis that referenced this pull request Apr 19, 2022
giffels added a commit to giffels/tardis that referenced this pull request Jan 20, 2023
giffels added a commit to giffels/tardis that referenced this pull request Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants