-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drones sometimes removed after payload has started running (race issue) #172
Comments
HI @olifre, thank you for this detailed report. We have thought about a potential solution or at least a way how to keep to impact as small as possible. We would propose the following. Instead of removing drones in In addition, we would like to implement a new feature in Short form
|
Pushing the discussion here instead of MatterMiners/cobald#89 since we're closer to the actual use-case here. We seem to have two opposing age-based decisions to make:
Either one alone seems straightforward but both at once is tricky. In the case we are facing here, I think gunning for oldest drones would actually work out, since it means we just never kill a booting drone. Would that work for other cases? |
Hi together, Of course, @maxfischer2781 is right that for established drones, disabling older drones first seems the best age-based decision. This approach indeed might also work for new drones, at least of the top of my head, I don't see why it should not. I have a third idea, which might be interesting (it came to me only now after reading @giffels nice explanation). However, I'm not sure if it matches the general structure sufficiently well to not cause a mess: Now the interesting question is what I think about for the deactivation. In terms of commands, I'd think about the following:
So "deactivation" would be an inhibitor for the payload, and also not be harmful if the job actually is already running for some reason. The problem I see here is which property to change to have a portable solution (and keep the payload of the drone generic). Also, it seems to break with the idea of the state changer doing "one thing" in each state change (unless another state is introduced only for this). So I'm not sure if this idea is a good one, just wanted to throw it here ;-). |
Hi @olifre, thanks for sharind your input. I have already thought the other way around. Like putting something in place that avoids to start new payloads on the drone until it has been "integrated" in OBS. We could probably use the
Later on we are going to set the value of |
Hi @giffels , thinking about this the other way around in fact seems more reasonable than my idea, especially since it eases the abstraction, because you don't need to talk to the LBS. Of course, that "centralizes" the scaling issue. My expectation would be (without elaborate research) that Thinking about "drone-readiness", that also means that another workaround without a scaling problem (but costing a bit of efficiency) could be to modify the
This should prevent the race since a drone can not accept jobs immediately, so while we could still |
I think delaying the initial start would just be pushing out the race condition, not fixing it. There's still an inherent non-zero delay for any kind of information in the Condor information system (and any other similar system). Unless we put all TARDIS delays well above these inherent delays, we can always run into such issues; and if we do make the delays as large, we have a significant boundary on what the system can do. Going the route of graceful shutdown, e.g via |
@maxfischer2781 That's a good point, indeed those two minutes might not be sufficient at all, given the collector also caches state, and going to more macroscopic numbers will cause noticeable issues on other ends as you pointed out. Then I'm out of alternate useful ideas and am all for approaching an age-based gunning to reduce the racing issue (let's hope nobody takes this statement out of context 😉 ). I also believe that gunning for the oldest drones may work out best (but probably only trying and observing will really tell). |
I would propose, we implement the age-based releasing of drones and the replaced state transistion and you @olifre give it a try?
Does that sound reasonable? |
As for the second point, after some pondering I would go for deriving a FactoryPool specific for TARDIS drones (and put it in the TARDIS package). That would give us some insight into drone lifetime and allow for use-case specific tweaks. Any objections? |
No objections, sounds like a solid approach. Go ahead! |
That sounds great, we'll be happy to test :-). |
I'd like to motivate a more generic discussion about the concept that is currently implemented in TARDIS. I think the hot fix is a good start to mitigate the issue in the short term, but the underlying issue is still not considered and can even get worse in the future (patch on top of patch on top of ...). |
Thanks @eileen-kuehn for suggesting a workshop about the concepts currently implemented in To be on the same page, the only affected drone state is the
|
We observe a timeline like this in an astonishingly large number of cases:
BootingState
,ResourceStatus.Booting: 1
shadow
of the LBS is born, i.e. drone actually starts executionCleanupState
, TARDIS still thinksResourceStatus.Booting: 1
Requesting graceful removal of job.
condor_rm
At this point, the
starter
of the LBS confirmsShutdownGraceful all jobs
.The payload is already running at this point, and killed without any real grace. Due to signal handling sometimes taking longer, especially with containers, we often have to wait for a bit afterwards until the job actually goes away, and still see this timeline (just for completeness, basically the "race" has already happened):
CleanupState
,ResourceStatus.Booting: 1
DrainState
,ResourceStatus.Running: 2
DrainingState
,ResourceStatus.Running: 2
startd
in LBS:slot1_17: max vacate time expired. Escalating to a fast shutdown of the job.
DownState
,ResourceStatus.Deleted: 4
I think the main issue is the initial race: A transition from
BootingState
toCleanupState
can happen while the resource status is outdated, causing a drone with running payload to be killed. This appears to become rather common when many short-lived drones are used and drone count fluctuates significantly.This causes the job from the OBS to be re-started elsewhere, which some VOs don't like.
I don't have a good suggestion on how to solve this, but one working "trick" (only applying to HTCondor LBS's) might be to add a constraint to
condor_rm
checking that the resource is still in the expected state to prevent removal if it is not.However, that can't be copied to other batch systems as-is. So maybe an explicit update of the
ResourceStatus
for the single resource before killing it could work, but that's of course some overhead.(pinging also @wiene )
The text was updated successfully, but these errors were encountered: