Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 fix disruptive concurrency issue with observation cycle 🚨 #4163

Merged
merged 10 commits into from
Apr 26, 2023

Conversation

GitHK
Copy link
Contributor

@GitHK GitHK commented Apr 25, 2023

What do these changes do?

If a service is being closed and mark_service_for_removal is called, the cancellation is being abandoned and tried once again. This causes lots of errors and tracebacks.
This PR avoids cancelling the current service close procedure and trying it again.

NOTE:

  • expect no more tracebacks in director-v2 after this
  • some errors will still be showing up (since the base feature was hacked in) -> to solve these issues a redesign of the scheduler (for dynamic services) is required

Bonus:

  • ⬆️ pip~=23.1 repo wide since all tests were failing
  • ♻️ using better task cancellation pattern

🚨 To monitor

master: no more tracebacks appear in the logs of the director-v2
staging AWS: autoscaling behaves as expected

Related issue/s

How to test

DevOps Checklist

@GitHK GitHK changed the title 🐛 fix nasty concurrency issue with observation cycle 🐛 fix disruptive concurrency issue with observation cycle Apr 25, 2023
@GitHK GitHK self-assigned this Apr 25, 2023
@GitHK GitHK added this to the Jelly Beans milestone Apr 25, 2023
@GitHK GitHK added the t:maintenance Some planned maintenance work label Apr 25, 2023
@codecov
Copy link

codecov bot commented Apr 25, 2023

Codecov Report

Merging #4163 (4e4d5ec) into master (5881834) will increase coverage by 28.7%.
The diff coverage is 91.3%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #4163      +/-   ##
=========================================
+ Coverage    57.2%   86.0%   +28.7%     
=========================================
  Files         666     839     +173     
  Lines       29249   37852    +8603     
  Branches      585     532      -53     
=========================================
+ Hits        16749   32554   +15805     
+ Misses      12377    5173    -7204     
- Partials      123     125       +2     
Flag Coverage Δ
integrationtests 67.3% <88.7%> (+21.9%) ⬆️
unittests 82.5% <79.4%> (+3.2%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...dules/dynamic_sidecar/scheduler/_core/_observer.py 90.5% <ø> (+60.3%) ⬆️
.../simcore_service_webserver/notifications/_utils.py 100.0% <ø> (ø)
...ver/notifications/_db_comp_tasks_listening_task.py 87.0% <87.0%> (ø)
...r_v2/modules/dynamic_sidecar/docker_api/_volume.py 96.8% <100.0%> (+61.2%) ⬆️
...s/dynamic_sidecar/scheduler/_core/_events_utils.py 91.6% <100.0%> (+63.3%) ⬆️
...ules/dynamic_sidecar/scheduler/_core/_scheduler.py 78.6% <100.0%> (+45.8%) ⬆️
...erver/src/simcore_service_webserver/application.py 97.8% <100.0%> (+97.8%) ⬆️
.../simcore_service_webserver/application_settings.py 98.7% <100.0%> (+18.4%) ⬆️
...re_service_webserver/application_settings_utils.py 97.7% <100.0%> (+1.1%) ⬆️
...ice_webserver/notifications/_rabbitmq_consumers.py 98.4% <100.0%> (ø)
... and 1 more

... and 508 files with indirect coverage changes

@codeclimate
Copy link

codeclimate bot commented Apr 26, 2023

Code Climate has analyzed commit 4e4d5ec and detected 0 issues on this pull request.

View more on Code Climate.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.1% 0.1% Duplication

@GitHK GitHK marked this pull request as ready for review April 26, 2023 09:37
@GitHK GitHK changed the title 🐛 fix disruptive concurrency issue with observation cycle 🐛 fix disruptive concurrency issue with observation cycle 🚨 Apr 26, 2023
@GitHK GitHK requested a review from mguidon April 26, 2023 11:00
Copy link
Contributor

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@pcrespov pcrespov merged commit b2223f1 into ITISFoundation:master Apr 26, 2023
Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. But some things are weird,please check at least the comments. Also I would like that we discuss before redesigning anything.

tasks = await client.tasks.list(filters={"service": service_id})
# NOTE: the service will have at most 1 task, since there is no restart
# policy present
if len(tasks) != 1:
# Docker swarm needs a bit of time to startup the tasks
raise TryAgain(
f"Expected 1 task for service {service_id}, found {tasks=}"
f"Expected 1 task for service {service_id} on node {node_uuid}, found {tasks=}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to add the user id and the service name for easier log reading

)

task = tasks[0]
task_status = task["Status"]
log.debug("Service %s, %s", service_id, f"{task_status=}")
task_state = task_status["State"]
if task_state not in SERVICE_FINISHED_STATES:
raise TryAgain(f"Waiting for task to finish: {task_status=}")
raise TryAgain(
f"Waiting for task to finish for service {service_id} on node {node_uuid}: {task_status=}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

"Service %s status: %s", service_id, f"{task_status=}"
"Service %s on node %s status: %s",
service_id,
node_uuid,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@@ -225,10 +229,10 @@ async def service_remove_sidecar_proxy_docker_networks_and_volumes(
)

# pylint: disable=protected-access
scheduler_data.dynamic_sidecar.service_removal_state.mark_removed()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird. The ordering seems strange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ordering is actually correct. the line below removed the schduler_data from the scheduler entry.
I think this likes have no influence, but the order is now correct

@@ -249,27 +249,25 @@ async def mark_service_for_removal(
return

current: SchedulerData = self._to_observe[service_name]

# if service is already being removed no need to force a cancellation and removal of the service
if current.dynamic_sidecar.service_removal_state.can_remove:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is can_remove meaning currently being removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct interpretation would be: "the scheduler can start removing at some point in the future". You cannot know when, no guarantees are provided.

"Service %s is already being removed, will not cancel observation",
node_uuid,
)
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I get it correctly, you prevent calling mark_for_removal more than once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Since calling it twice cancelled the previous task that was removing the services. If cancelled at the wrong time you would end up with some tracebacks in the director-v2.

The idea here is to only have one removal active at one time.

service_task: None | (
asyncio.Task | object
) = self._service_observation_task[service_name]
service_task: None | asyncio.Task | object = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very weird types here..

@GitHK GitHK deleted the patch-observation-cycle branch April 26, 2023 12:30
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request May 30, 2023
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t:maintenance Some planned maintenance work
Projects
None yet
4 participants