🐛 fix disruptive concurrency issue with observation cycle 🚨 #4163

GitHK · 2023-04-25T12:23:33Z

What do these changes do?

If a service is being closed and mark_service_for_removal is called, the cancellation is being abandoned and tried once again. This causes lots of errors and tracebacks.
This PR avoids cancelling the current service close procedure and trying it again.

NOTE:

expect no more tracebacks in director-v2 after this
some errors will still be showing up (since the base feature was hacked in) -> to solve these issues a redesign of the scheduler (for dynamic services) is required

Bonus:

⬆️ pip~=23.1 repo wide since all tests were failing
♻️ using better task cancellation pattern

🚨 To monitor

master: no more tracebacks appear in the logs of the director-v2
staging AWS: autoscaling behaves as expected

Related issue/s

How to test

DevOps Checklist

codecov · 2023-04-25T12:48:23Z

Codecov Report

Merging #4163 (4e4d5ec) into master (5881834) will increase coverage by 28.7%.
The diff coverage is 91.3%.

@@            Coverage Diff            @@
##           master   #4163      +/-   ##
=========================================
+ Coverage    57.2%   86.0%   +28.7%     
=========================================
  Files         666     839     +173     
  Lines       29249   37852    +8603     
  Branches      585     532      -53     
=========================================
+ Hits        16749   32554   +15805     
+ Misses      12377    5173    -7204     
- Partials      123     125       +2

Flag	Coverage Δ
integrationtests	`67.3% <88.7%> (+21.9%)`	⬆️
unittests	`82.5% <79.4%> (+3.2%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...dules/dynamic_sidecar/scheduler/_core/_observer.py	`90.5% <ø> (+60.3%)`	⬆️
.../simcore_service_webserver/notifications/_utils.py	`100.0% <ø> (ø)`
...ver/notifications/_db_comp_tasks_listening_task.py	`87.0% <87.0%> (ø)`
...r_v2/modules/dynamic_sidecar/docker_api/_volume.py	`96.8% <100.0%> (+61.2%)`	⬆️
...s/dynamic_sidecar/scheduler/_core/_events_utils.py	`91.6% <100.0%> (+63.3%)`	⬆️
...ules/dynamic_sidecar/scheduler/_core/_scheduler.py	`78.6% <100.0%> (+45.8%)`	⬆️
...erver/src/simcore_service_webserver/application.py	`97.8% <100.0%> (+97.8%)`	⬆️
.../simcore_service_webserver/application_settings.py	`98.7% <100.0%> (+18.4%)`	⬆️
...re_service_webserver/application_settings_utils.py	`97.7% <100.0%> (+1.1%)`	⬆️
...ice_webserver/notifications/_rabbitmq_consumers.py	`98.4% <100.0%> (ø)`
... and 1 more

... and 508 files with indirect coverage changes

…-cycle

codeclimate · 2023-04-26T09:28:28Z

Code Climate has analyzed commit 4e4d5ec and detected 0 issues on this pull request.

View more on Code Climate.

sonarqubecloud · 2023-04-26T09:29:32Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.1% Duplication

matusdrobuliak66

thanks!

sanderegg

Approved. But some things are weird,please check at least the comments. Also I would like that we discuss before redesigning anything.

sanderegg · 2023-04-26T12:19:32Z

...es/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/docker_api/_volume.py

                    tasks = await client.tasks.list(filters={"service": service_id})
                    # NOTE: the service will have at most 1 task, since there is no restart
                    # policy present
                    if len(tasks) != 1:
                        # Docker swarm needs a bit of time to startup the tasks
                        raise TryAgain(
-                            f"Expected 1 task for service {service_id}, found {tasks=}"
+                            f"Expected 1 task for service {service_id} on node {node_uuid}, found {tasks=}"


It might make sense to add the user id and the service name for easier log reading

sanderegg · 2023-04-26T12:20:01Z

...es/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/docker_api/_volume.py

                        )

                    task = tasks[0]
                    task_status = task["Status"]
                    log.debug("Service %s, %s", service_id, f"{task_status=}")
                    task_state = task_status["State"]
                    if task_state not in SERVICE_FINISHED_STATES:
-                        raise TryAgain(f"Waiting for task to finish: {task_status=}")
+                        raise TryAgain(
+                            f"Waiting for task to finish for service {service_id} on node {node_uuid}: {task_status=}"


sanderegg · 2023-04-26T12:20:32Z

...es/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/docker_api/_volume.py

-                            "Service %s status: %s", service_id, f"{task_status=}"
+                            "Service %s on node %s status: %s",
+                            service_id,
+                            node_uuid,


sanderegg · 2023-04-26T12:22:36Z

...-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/_core/_events_utils.py

@@ -225,10 +229,10 @@ async def service_remove_sidecar_proxy_docker_networks_and_volumes(
    )

    # pylint: disable=protected-access
+    scheduler_data.dynamic_sidecar.service_removal_state.mark_removed()


This looks weird. The ordering seems strange.

The ordering is actually correct. the line below removed the schduler_data from the scheduler entry.
I think this likes have no influence, but the order is now correct

sanderegg · 2023-04-26T12:24:28Z

...tor-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/_core/_scheduler.py

@@ -249,27 +249,25 @@ async def mark_service_for_removal(
                return

            current: SchedulerData = self._to_observe[service_name]
+
+            # if service is already being removed no need to force a cancellation and removal of the service
+            if current.dynamic_sidecar.service_removal_state.can_remove:


So is can_remove meaning currently being removed?

The correct interpretation would be: "the scheduler can start removing at some point in the future". You cannot know when, no guarantees are provided.

sanderegg · 2023-04-26T12:25:18Z

...tor-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/_core/_scheduler.py

+                    "Service %s is already being removed, will not cancel observation",
+                    node_uuid,
+                )
+                return


So if I get it correctly, you prevent calling mark_for_removal more than once?

Correct. Since calling it twice cancelled the previous task that was removing the services. If cancelled at the wrong time you would end up with some tracebacks in the director-v2.

The idea here is to only have one removal active at one time.

sanderegg · 2023-04-26T12:26:11Z

...tor-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/_core/_scheduler.py

-                service_task: None | (
-                    asyncio.Task | object
-                ) = self._service_observation_task[service_name]
+                service_task: None | asyncio.Task | object = (


Very weird types here..

fixed concurrency issue

546b531

GitHK changed the title ~~🐛 fix nasty concurrency issue with observation cycle~~ 🐛 fix disruptive concurrency issue with observation cycle Apr 25, 2023

Andrei Neagu added 3 commits April 25, 2023 14:26

revert

b0945e9

bump pip version

0b1ca89

repo wide pip upgrade

3e01785

GitHK self-assigned this Apr 25, 2023

GitHK added the changelog:🐛bugfix label Apr 25, 2023

GitHK added this to the Jelly Beans milestone Apr 25, 2023

GitHK added the t:maintenance Some planned maintenance work label Apr 25, 2023

Andrei Neagu added 6 commits April 26, 2023 10:59

added logs for debug

c7c8c8d

added extra debug logs

06cffe0

fixed concurrency issue

4ad0bcb

Merge remote-tracking branch 'upstream/master' into patch-observation…

3a99a05

…-cycle

refactor

88737a7

refactor

4e4d5ec

GitHK marked this pull request as ready for review April 26, 2023 09:37

GitHK requested review from sanderegg, pcrespov and matusdrobuliak66 as code owners April 26, 2023 09:37

GitHK changed the title ~~🐛 fix disruptive concurrency issue with observation cycle~~ 🐛 fix disruptive concurrency issue with observation cycle 🚨 Apr 26, 2023

GitHK mentioned this pull request Apr 26, 2023

improving dynamic-sidecar design ITISFoundation/osparc-issues#638

Open

GitHK requested a review from mguidon April 26, 2023 11:00

GitHK mentioned this pull request Apr 26, 2023

🚀 Pre-release master -> staging_JellyBeans3 #4156

Closed

15 tasks

matusdrobuliak66 approved these changes Apr 26, 2023

View reviewed changes

pcrespov approved these changes Apr 26, 2023

View reviewed changes

pcrespov merged commit b2223f1 into ITISFoundation:master Apr 26, 2023

sanderegg approved these changes Apr 26, 2023

View reviewed changes

GitHK deleted the patch-observation-cycle branch April 26, 2023 12:30

matusdrobuliak66 mentioned this pull request May 30, 2023

🚀 Release v1.53.0 #4236

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 fix disruptive concurrency issue with observation cycle 🚨 #4163

🐛 fix disruptive concurrency issue with observation cycle 🚨 #4163

GitHK commented Apr 25, 2023 •

edited

Loading

codecov bot commented Apr 25, 2023 •

edited

Loading

codeclimate bot commented Apr 26, 2023

sonarqubecloud bot commented Apr 26, 2023

matusdrobuliak66 left a comment

sanderegg left a comment

sanderegg Apr 26, 2023

sanderegg Apr 26, 2023

sanderegg Apr 26, 2023

sanderegg Apr 26, 2023

GitHK Apr 26, 2023

sanderegg Apr 26, 2023

GitHK Apr 26, 2023

sanderegg Apr 26, 2023

GitHK Apr 26, 2023

sanderegg Apr 26, 2023

🐛 fix disruptive concurrency issue with observation cycle 🚨 #4163

🐛 fix disruptive concurrency issue with observation cycle 🚨 #4163

Conversation

GitHK commented Apr 25, 2023 • edited Loading

What do these changes do?

🚨 To monitor

Related issue/s

How to test

DevOps Checklist

codecov bot commented Apr 25, 2023 • edited Loading

Codecov Report

codeclimate bot commented Apr 26, 2023

sonarqubecloud bot commented Apr 26, 2023

matusdrobuliak66 left a comment

Choose a reason for hiding this comment

sanderegg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GitHK commented Apr 25, 2023 •

edited

Loading

codecov bot commented Apr 25, 2023 •

edited

Loading