-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 fix disruptive concurrency issue with observation cycle 🚨 #4163
🐛 fix disruptive concurrency issue with observation cycle 🚨 #4163
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4163 +/- ##
=========================================
+ Coverage 57.2% 86.0% +28.7%
=========================================
Files 666 839 +173
Lines 29249 37852 +8603
Branches 585 532 -53
=========================================
+ Hits 16749 32554 +15805
+ Misses 12377 5173 -7204
- Partials 123 125 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Code Climate has analyzed commit 4e4d5ec and detected 0 issues on this pull request. View more on Code Climate. |
Kudos, SonarCloud Quality Gate passed!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. But some things are weird,please check at least the comments. Also I would like that we discuss before redesigning anything.
tasks = await client.tasks.list(filters={"service": service_id}) | ||
# NOTE: the service will have at most 1 task, since there is no restart | ||
# policy present | ||
if len(tasks) != 1: | ||
# Docker swarm needs a bit of time to startup the tasks | ||
raise TryAgain( | ||
f"Expected 1 task for service {service_id}, found {tasks=}" | ||
f"Expected 1 task for service {service_id} on node {node_uuid}, found {tasks=}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to add the user id and the service name for easier log reading
) | ||
|
||
task = tasks[0] | ||
task_status = task["Status"] | ||
log.debug("Service %s, %s", service_id, f"{task_status=}") | ||
task_state = task_status["State"] | ||
if task_state not in SERVICE_FINISHED_STATES: | ||
raise TryAgain(f"Waiting for task to finish: {task_status=}") | ||
raise TryAgain( | ||
f"Waiting for task to finish for service {service_id} on node {node_uuid}: {task_status=}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
"Service %s status: %s", service_id, f"{task_status=}" | ||
"Service %s on node %s status: %s", | ||
service_id, | ||
node_uuid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
@@ -225,10 +229,10 @@ async def service_remove_sidecar_proxy_docker_networks_and_volumes( | |||
) | |||
|
|||
# pylint: disable=protected-access | |||
scheduler_data.dynamic_sidecar.service_removal_state.mark_removed() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks weird. The ordering seems strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ordering is actually correct. the line below removed the schduler_data from the scheduler entry.
I think this likes have no influence, but the order is now correct
@@ -249,27 +249,25 @@ async def mark_service_for_removal( | |||
return | |||
|
|||
current: SchedulerData = self._to_observe[service_name] | |||
|
|||
# if service is already being removed no need to force a cancellation and removal of the service | |||
if current.dynamic_sidecar.service_removal_state.can_remove: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is can_remove meaning currently being removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct interpretation would be: "the scheduler can start removing at some point in the future". You cannot know when, no guarantees are provided.
"Service %s is already being removed, will not cancel observation", | ||
node_uuid, | ||
) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I get it correctly, you prevent calling mark_for_removal more than once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Since calling it twice cancelled the previous task that was removing the services. If cancelled at the wrong time you would end up with some tracebacks in the director-v2.
The idea here is to only have one removal active at one time.
service_task: None | ( | ||
asyncio.Task | object | ||
) = self._service_observation_task[service_name] | ||
service_task: None | asyncio.Task | object = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very weird types here..
What do these changes do?
If a service is being closed and
mark_service_for_removal
is called, the cancellation is being abandoned and tried once again. This causes lots of errors and tracebacks.This PR avoids cancelling the current service close procedure and trying it again.
NOTE:
Bonus:
pip~=23.1
repo wide since all tests were failing🚨 To monitor
master
: no more tracebacks appear in the logs of the director-v2staging AWS
: autoscaling behaves as expectedRelated issue/s
How to test
DevOps Checklist