-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-use the last desired patch on empty batch #3453
Conversation
@nikola-jokic, a quick question. Do I understand correctly that the empty message here occurs because when the GetMessage function is called, if there are no messages in the queue, the server returns I also understand correctly that with each empty message, the I'm asking all these questions because it seems we've encountered a problem that arises when several factors coincide:
After this, the GitHub Workflow remains in a queued state until another job for this type of runner appears, because the runner didn't have time to process the job due to the ScaleDown patch. Once a new job appears, the process will repeat, and there will be two jobs in the queue waiting to be executed. According to the code in this PR, it should correct the situation, as when an |
Hey @verdel, You are 100% right. It was a mistake on my end where on an empty batch, I forced the minimum required number of runners to avoid wasting resources. However, I did not account for the case where the runner can be pending for the duration of the empty batch. This change would "force" the last known desired state that the listener knows about. |
Is it normal behavior that the |
It should be okay, since patch ID 0 would only mean: "make sure this state is matched". However, your question just gave me a better idea. Let me test it out and I'll get back to you. |
Thank you very much for the detailed explanation. |
…istener when target == min
if targetRunnerCount == w.config.MinRunners { | ||
// We have an empty batch, and the last patch was the min runners. | ||
// Since this is an empty batch, and we are at the min runners, they should all be idle. | ||
// If controller created few more pods on accident (during scale down events), | ||
// this situation allows the controller to scale down to the min runners. | ||
// However, it is important to keep the patch sequence increasing so we don't ignore one batch. | ||
desiredPatchID = 0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this code is unnecessary. We want to trigger a reconcile in case we receive an empty batch and targetRunnerCount == w.config.MinRunners
, by changing the PatchID
value, but this will happen anyway because of this code:
w.patchSeq++
desiredPatchID := w.patchSeq
Why reset this counter to zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of setting patchID to 0 is to remove ephemeral runners that are created, but shouldn't have been.
There is a rare case where the listener receives scale down event. Let's say the desired state was 5 and the 1 has finished.
During the patch, 2 ephemeral runners are finished.
This situation would cause ephemeral runner set controller to create 1 more pod (We match the current state of the cluster, which is 3, the patch says we need to have a state of 4). Very rare case but can happen when runners are finishing at the same time. At that point, the ephemeral runner set will create 1 more runner.
If the cluster is busy, this situation can actually improve the performance, since on the next job, you will already have a runner ready to pick up the job, so it is re-used.
But, when your cluster becomes idle, that is when we really have to remove these runners that are created due to a timing difference. So to self correct, we force the min runners at the time they are all idle.
@nikola-jokic, and I still have a question; perhaps you could help with some additional information. The empty message (empty batch) here occurs because when the GetMessage function is called, if there are no messages in the queue, the server returns http.StatusAccepted here as a result of the long poll connection after 50 seconds, or does the server response actually come with http.StatusOk here but with an empty body? |
Thank you, @verdel, for asking all the right questions! This question made me realize that we don't need to force the state on every empty batch. Whenever we scale up, we are matching the desired state. The scale down is already handled by the very nature of the ephemeral runner, so we simply need to clean them up. Only when the scale set is completely idle, i.e. there are no jobs coming and there are no busy runners, only then we "force" the min runners state. This allows the controller to self correct. |
How will you determine the current number of busy runners and that it equals zero? It seems you want to obtain the number of active runners every time you receive an empty batch (every 50 seconds), and if it equals zero, then patch the Since I am not very familiar with the specifics of the controller's operation, I cannot fully understand whether the following situation might recur due to an incorrect determination of the number of busy runners:
|
I'll try to explain my reasoning. When the job is received, the number of runners is Now, we start creating a pod, and it is in a pending state. Next batch comes in, and it is empty. So Now let's say that the job is done. We get the message saying we need 0 runners, and 1 job is done. We are not hitting the if case, and we calculate the targetRunnerCount, which would be 0. This value is set as the last patch, and patchID is incremented. On the next empty batch, we are hitting the same if case. At this point, the number of runners must be the number of idle runners (i.e. the minRunners count). The reason all of them must be idle is:
|
Let's summarize to make sure we understand all the cases in the same way. Originally, the whole idea of this patch was to restore the number of runners in the case that a runner in a pending state was terminated instead of a runner who had completed their task and was in the process of finishing up. It is also necessary to handle the case when something forcibly terminated the work of a runner who had not yet completed their task. In general, the only way to handle all cases is to patch the These empty batches result from long polling requests to the GitHub API and occur every 50 seconds. The GitHub API on its end has a timeout of 50 seconds and apparently returns HTTP 202 if there are no tasks in the queue within 50 seconds from the start of the request. Now, a question about the operation of the controller that monitors the
But if at this moment in the EphemeralRunnerSet it is indicated that The same happens in the case where the EphemeralRunnerSet indicates Overall, the implementation outlined in this PR solves the problem. If we want to change the Resetting the To repeat, the only question I have left is why the controller responsible for processing the |
This summary is almost completely correct. We do not scale based on jobs assigned, we scale based on the statistics. So on each message that is not an empty batch, the server would let us know what is the desired amount of runners (not counting the minRunners) to do a job. So in your example, when the new job comes in, you would have To answer your question, the controller should not always match the number of replicas because the time it takes for the listener to patch the new desired replicas can be and usually is greater than the time the controller can react to ephemeral runner changes. Each change on the ephemeral runner triggers the reconciliation loop of the ephemeral runner set. If we try and match the state each time the pod is done, we would end up creating the pod, then the patch to scale down would happen, then we would have to clean it up. Also, important thing to note, patch ID can avoid these problems when there is a job spike. When we start creating ephemeral runners, we tag them with the patch ID. Then each transition from pending to running would not trigger the unnecessary count check and would postpone ephemeral runner cleanups. This ensures that ARC scales up as quickly as possible, since each cleanup would also again trigger the reconciliation loop, causing slower reaction to scaling up. Every state change is communicated by the listener so only then, the ephemeral runner controller will need to correct the state. Also, when we start scaling down, these checks are going to be performed more frequently, since the maximum of all patchIDs from ephemeral runners can only be less than or equal to the patch ID of the ephemeral runner set (It will be less than when ephemeral runners with the maximum patch ID are done and cleaned up). However, in this case, we can tolerate these checks, and this is when we can potentially overscale. That is why we need the patch ID 0 to self correct when the cluster is idle. |
I carefully reviewed the code of the ephemeral runner set controller and saw how the I was puzzled by this part of the code. In what situation could |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move forward with this right now, but as we discussed we should invest time to improve the maintainability of this down the line.
Fixes #3450