Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow deadlocks on mutex in steps template if controller is restarted #8684

Closed
3 tasks done
Gibstick opened this issue May 9, 2022 · 29 comments · Fixed by #13553
Closed
3 tasks done

Workflow deadlocks on mutex in steps template if controller is restarted #8684

Gibstick opened this issue May 9, 2022 · 29 comments · Fixed by #13553
Assignees
Labels
area/mutex-semaphore P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/workaround There's a workaround, might not be great, but exists type/bug

Comments

@Gibstick
Copy link

Gibstick commented May 9, 2022

Checklist

  • Double-checked my configuration
  • Tested using the latest version.
  • Used the Emissary executor.

Summary

What happened/what you expected to happen?

I expect my workflow to complete successfully even if the workflows controller is restarted. The same issue does not occur if the mutex is on the container template or at the workflow level.

What version are you running?
v3.3.4 is what I've been using for this reproduction, but I've also been able to reproduce it in v3.3.3, v3.3.2, and v3.2.7 with a more complicated workflow.

Diagnostics

This is a workflow that reliably reproduces the issue. The steps I use are:

argo submit deadlock.yaml
sleep 5
kubectl delete pod -l app.kubernetes.io/component=workflow-controller
# deadlock.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: deadlock-test-
spec:
  entrypoint: main-steps

  templates:
    - name: main-steps
      synchronization:
        mutex:
          name: "mutex-{{workflow.name}}"
      steps:
        - - name: main
            template: main-container

    - name: main-container
      container:
        image: "busybox:latest"
        command: ["sh"]
        args: ["-c", "sleep 10; exit 0"]

Pod logs
No pod logs from the main container since it just sleeps and exits.

Wait container logs

time="2022-05-09T12:57:42.216Z" level=info msg="Starting Workflow Executor" executorType=docker version=v3.3.2
time="2022-05-09T12:57:42.217Z" level=info msg="Creating a docker executor"
time="2022-05-09T12:57:42.217Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2022-05-09T12:57:42.217Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=kube-system podName=deadlock-test-bnbjc-938207750 template="{\"name\":\"main-container\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"busybox:latest\",\"command\":[\"sh\"],\"args\":[\"-c\",\"sleep 10; exit 0\"],\"resources\":{}}}" version="&Version{Version:v3.3.2,BuildDate:2022-04-20T21:41:09Z,GitCommit:35492a1700a0f279694cac874b6d9c07a08265d1,GitTag:v3.3.2,GitTreeState:clean,GoVersion:go1.17.9,Compiler:gc,Platform:linux/arm64,}"
time="2022-05-09T12:57:42.217Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:42.218Z" level=info msg="Starting deadline monitor"
time="2022-05-09T12:57:42.231Z" level=info msg="listed containers" containers="map[wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:42.231Z" level=info msg="mapped container name \"wait\" to container ID \"a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522\" (created at 2022-05-09 12:57:42 +0000 UTC, status Up)"
time="2022-05-09T12:57:43.232Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:43.248Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:43.248Z" level=info msg="mapped container name \"main\" to container ID \"0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862\" (created at 2022-05-09 12:57:43 +0000 UTC, status Up)"
time="2022-05-09T12:57:44.220Z" level=info msg="docker wait 0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862"
time="2022-05-09T12:57:44.248Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:44.269Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:45.270Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:45.287Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:46.288Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:46.323Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:47.325Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:47.350Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:48.351Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:48.380Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:49.380Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:49.409Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:50.411Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:50.449Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:51.450Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:51.490Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:52.491Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:52.536Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:53.289Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:53.310Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Exited {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:53.310Z" level=info msg="Main container completed"
time="2022-05-09T12:57:53.310Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2022-05-09T12:57:53.310Z" level=info msg="No output parameters"
time="2022-05-09T12:57:53.310Z" level=info msg="No output artifacts"
time="2022-05-09T12:57:53.310Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:53.327Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Exited {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:53.327Z" level=info msg="Killing sidecars []"
time="2022-05-09T12:57:53.327Z" level=info msg="Alloc=6106 TotalAlloc=10797 Sys=19410 NumGC=3 Goroutines=8"

Workflow controller logs

Before killing (old workflow controller pod):
logs-before.txt

After killing (new workflow controller pod):
logs-after.txt

Workflow resource

Looking at the workflow after a while, it shows a deadlock in the synchronization section (sorry this doesn't match the logs above, but it's the same structure every time). It's holding and waiting on the same lock.

  synchronization:
    mutex:
      holding:
      - holder: deadlock-test-4lp66
        mutex: kube-system/Mutex/mutex-deadlock-test-4lp66
      waiting:
      - holder: kube-system/deadlock-test-4lp66
        mutex: kube-system/Mutex/mutex-deadlock-test-4lp66

workflow-resource.txt (uploaded as txt cause GitHub doesn't allow .yaml)


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Jun 12, 2022
@Gibstick
Copy link
Author

I can still reproduce this issue using the same steps on v3.3.6

@stale stale bot removed the problem/stale This has not had a response in some time label Jun 14, 2022
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Jul 10, 2022
@Gibstick
Copy link
Author

Still happening on v3.3.8 with the same steps

@stale stale bot removed the problem/stale This has not had a response in some time label Jul 16, 2022
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Jul 31, 2022
@Gibstick
Copy link
Author

v3.3.8 is still the latest release so I think it's fair to say this bug is still relevant

@stale stale bot removed the problem/stale This has not had a response in some time label Jul 31, 2022
@alexec alexec removed the triage label Sep 6, 2022
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Oct 1, 2022
@Gibstick
Copy link
Author

Gibstick commented Oct 3, 2022

I'm still able to reproduce this issue on v3.4.1 with the same steps

@sarabala1979 sarabala1979 added P3 Low priority and removed problem/stale This has not had a response in some time labels Oct 3, 2022
@sarabala1979 sarabala1979 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important and removed P3 Low priority labels Oct 11, 2022
@isubasinghe
Copy link
Member

isubasinghe commented Oct 19, 2022

I had a look at this and I think this behaviour is fixable if "The same issue does not occur if the mutex is on the container template or at the workflow level." this is true.

@Gibstick would you mind providing an example where it does succeed? I am not entirely familiar with using mutexes in argo my self.

@sarabala1979 could you please assign this to me?

Thanks

@isubasinghe
Copy link
Member

isubasinghe commented Oct 20, 2022

Okay this was quite difficult to debug but I think I finally found the reason as to why this doesn't work.

Let's imagine we have a workflow name called "xyz"
We launch "xyz" and then take a lock on "argo/xyz".
Now the controller crashes.
On restart instead of looking for a lock named "argo/xyz" we look for a lock named "argo/xyz/xyz".

There is something going on with the how the holderKey is generated.

I believe that fixing this should fix this bug.

@isubasinghe
Copy link
Member

isubasinghe commented Oct 21, 2022

In workflow/sync/sync_manager.go the behaviour of how mutexes are implemented looks wrong to me. I believe this is what is causing #8684.

This is the issue:
On restart, the key to acquire locks is called via the output of getResourceKey.
Please check func (cm *Manager) Initialize(wfs []wfv1.Workflow)

Despite this, if we look at how the Manger attempts to acquire a lock in func (cm *Manager) TryAcquire(wf *wfv1.Workflow, nodeName string, syncLockRef *wfv1.Synchronization) (bool, bool, string, error), we can see that it uses getHolderKey:

This results in the incorrect names being described above and why I believe this deadlock occurs.
Luckily this is fairly easy to fix.

Could one of the maintainers please confirm which is the desired behaviour here please? My belief that getHolderKey is the correct one.

If they are both correct it could be the encoding and decoding from the configmap that is wrong.

func getHolderKey(wf *wfv1.Workflow, nodeName string) string {
	if wf == nil {
		return ""
	}
	key := fmt.Sprintf("%s/%s", wf.Namespace, wf.Name)
	if nodeName != "" {
		key = fmt.Sprintf("%s/%s", key, nodeName)
	}
	return key
}

func getResourceKey(namespace, wfName, resourceName string) string {
	resourceKey := fmt.Sprintf("%s/%s", namespace, wfName)
	if resourceName != wfName {
		resourceKey = fmt.Sprintf("%s/%s", resourceKey, resourceName)
	}
	return resourceKey
}

@Gibstick
Copy link
Author

@Gibstick would you mind providing an example where it does succeed? I am not entirely familiar with using mutexes in argo my self.

@isubasinghe I didn't test every scenario but the workaround we adopted was to move the synchronization to the top-level of the workflow spec:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: deadlock-test-
spec:
  entrypoint: main-steps
  # mutex moved up here
  synchronization:
    mutex:
      name: "mutex-{{workflow.name}}"

  templates:
    - name: main-steps
      steps:
        - - name: main
            template: main-container

    - name: main-container
      container:
        image: "busybox:latest"
        command: ["sh"]
        args: ["-c", "sleep 10; exit 0"]

@isubasinghe
Copy link
Member

Thanks for that @Gibstick I will check why this works as well

@sarabala1979 sarabala1979 added P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority and removed P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Jan 5, 2023
@sarabala1979
Copy link
Member

@isubasinghe @Gibstick Can you try it on v3.4.4?

@Gibstick
Copy link
Author

Gibstick commented Jan 5, 2023

Unfortunately I'm still able to reproduce it with the same steps. I made sure I was running the correct image version by kubectl describeing the pod and verifying the image quay.io/argoproj/workflow-controller@sha256:d06860f1394a94ac3ff8401126ef32ba28915aa6c3c982c7e607ea0b4dadb696 was indeed 3.4.4. The workflow yaml still shows the same deadlock.

@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Jan 21, 2023
@Gibstick
Copy link
Author

As far as I can tell this bug is still relevant

@caelan-io
Copy link
Member

Update: we are working on a larger code refactor (#10191) that will resolve this issue, but it was delayed by the holiday season. We will update again asap.

@stale stale bot removed the problem/stale This has not had a response in some time label Jan 24, 2023
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Mar 25, 2023
@caelan-io caelan-io removed the problem/stale This has not had a response in some time label Mar 27, 2023
@Bwvolleyball
Copy link

I think we are also hitting this issue - we have just upgraded to the latest argo 3.4.9, if that's helpful. Right now our synchronization semaphore is configured like this:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: template-name
spec:
  templates:
    - name: template-name
       synchronization:
         semaphore:
           configMapKeyRef:
             name: template-synchronization-config
             key: template-key
       container: # rest of the template config here.

I'm trying to decipher the comments in this issue to see if there's a different way we can have this synchronization in place that won't be affected while the refactor is going on.

Is there an easy way to clear out the semaphore locks on argo's side? Restarting the workflow-controller doesn't seem to be viable to recover.

@agilgur5 agilgur5 added the solution/workaround There's a workaround, might not be great, but exists label Sep 10, 2023
@yossicohn
Copy link

We also had this over v3.5.2...the solution was to restart the controller.
It does mean we have a major issue with the workflows using the Mutex feature.
Does a Semaphor with value 1 would mitigate that?

@isubasinghe
Copy link
Member

@Bwvolleyball and @yossicohn

This is a pretty bad issue caused by the key used for the lock being different on acquire and release.

We are still wondering what the options could be that offers the least compromises. If you have opinions on this, feel free to address them here: #10267 Because imo that is the core issue.

I am thinking that a potential solution is to just dump this kinda data to the SQL database when that is available.

@SebastianGoeb
Copy link

Any progress on this?

It seems to me that this is by definition distributed locking, and can't possibly be solved in-process (consider rolling upgrades or multiple controller pods).

I am thinking that a potential solution is to just dump this kinda data to the SQL database when that is available.

Can't that also be overlayed on etcd somehow?

@isubasinghe
Copy link
Member

isubasinghe commented Aug 16, 2024

Working on a fix for this now that won't be stored on a configmap(solution proposed by me previously) , will sadly be a breaking change though :(

EDIT: I was able to fix this such that this won't be a breaking change. See #13553

@Joibel Joibel closed this as completed in 440f1cc Sep 18, 2024
Joibel pushed a commit to pipekit/argo-workflows that referenced this issue Sep 19, 2024
Joibel pushed a commit that referenced this issue Sep 20, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Sep 20, 2024
Joibel pushed a commit to Joibel/argo-workflows that referenced this issue Dec 6, 2024
Fixes: argoproj#8684
Backports: argoproj#13553
Signed-off-by: isubasinghe <isitha@pipekit.io>
Co-authored-by: Isitha Subasinghe <isitha@pipekit.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/mutex-semaphore P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/workaround There's a workaround, might not be great, but exists type/bug
Projects
None yet
9 participants