-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow deadlocks on mutex in steps template if controller is restarted #8684
Comments
This comment was marked as resolved.
This comment was marked as resolved.
I can still reproduce this issue using the same steps on v3.3.6 |
This comment was marked as resolved.
This comment was marked as resolved.
Still happening on v3.3.8 with the same steps |
This comment was marked as resolved.
This comment was marked as resolved.
v3.3.8 is still the latest release so I think it's fair to say this bug is still relevant |
This comment was marked as resolved.
This comment was marked as resolved.
I'm still able to reproduce this issue on v3.4.1 with the same steps |
I had a look at this and I think this behaviour is fixable if "The same issue does not occur if the mutex is on the container template or at the workflow level." this is true. @Gibstick would you mind providing an example where it does succeed? I am not entirely familiar with using mutexes in argo my self. @sarabala1979 could you please assign this to me? Thanks |
Okay this was quite difficult to debug but I think I finally found the reason as to why this doesn't work. Let's imagine we have a workflow name called "xyz" There is something going on with the how the holderKey is generated. I believe that fixing this should fix this bug. |
In workflow/sync/sync_manager.go the behaviour of how mutexes are implemented looks wrong to me. I believe this is what is causing #8684. This is the issue: Despite this, if we look at how the Manger attempts to acquire a lock in This results in the incorrect names being described above and why I believe this deadlock occurs. Could one of the maintainers please confirm which is the desired behaviour here please? My belief that getHolderKey is the correct one. If they are both correct it could be the encoding and decoding from the configmap that is wrong. func getHolderKey(wf *wfv1.Workflow, nodeName string) string {
if wf == nil {
return ""
}
key := fmt.Sprintf("%s/%s", wf.Namespace, wf.Name)
if nodeName != "" {
key = fmt.Sprintf("%s/%s", key, nodeName)
}
return key
}
func getResourceKey(namespace, wfName, resourceName string) string {
resourceKey := fmt.Sprintf("%s/%s", namespace, wfName)
if resourceName != wfName {
resourceKey = fmt.Sprintf("%s/%s", resourceKey, resourceName)
}
return resourceKey
} |
@isubasinghe I didn't test every scenario but the workaround we adopted was to move the synchronization to the top-level of the workflow spec: apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: deadlock-test-
spec:
entrypoint: main-steps
# mutex moved up here
synchronization:
mutex:
name: "mutex-{{workflow.name}}"
templates:
- name: main-steps
steps:
- - name: main
template: main-container
- name: main-container
container:
image: "busybox:latest"
command: ["sh"]
args: ["-c", "sleep 10; exit 0"] |
Thanks for that @Gibstick I will check why this works as well |
@isubasinghe @Gibstick Can you try it on v3.4.4? |
Unfortunately I'm still able to reproduce it with the same steps. I made sure I was running the correct image version by |
This comment was marked as resolved.
This comment was marked as resolved.
As far as I can tell this bug is still relevant |
Update: we are working on a larger code refactor (#10191) that will resolve this issue, but it was delayed by the holiday season. We will update again asap. |
This comment was marked as resolved.
This comment was marked as resolved.
I think we are also hitting this issue - we have just upgraded to the latest argo apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: template-name
spec:
templates:
- name: template-name
synchronization:
semaphore:
configMapKeyRef:
name: template-synchronization-config
key: template-key
container: # rest of the template config here. I'm trying to decipher the comments in this issue to see if there's a different way we can have this synchronization in place that won't be affected while the refactor is going on. Is there an easy way to clear out the semaphore locks on argo's side? Restarting the workflow-controller doesn't seem to be viable to recover. |
We also had this over v3.5.2...the solution was to restart the controller. |
This is a pretty bad issue caused by the key used for the lock being different on acquire and release. We are still wondering what the options could be that offers the least compromises. If you have opinions on this, feel free to address them here: #10267 Because imo that is the core issue. I am thinking that a potential solution is to just dump this kinda data to the SQL database when that is available. |
Any progress on this? It seems to me that this is by definition distributed locking, and can't possibly be solved in-process (consider rolling upgrades or multiple controller pods).
Can't that also be overlayed on etcd somehow? |
Working on a fix for this now that won't be stored on a configmap(solution proposed by me previously) , will sadly be a breaking change though :( EDIT: I was able to fix this such that this won't be a breaking change. See #13553 |
argoproj#8684 (argoproj#13553) Signed-off-by: isubasinghe <isitha@pipekit.io>
Fixes: argoproj#8684 Backports: argoproj#13553 Signed-off-by: isubasinghe <isitha@pipekit.io> Co-authored-by: Isitha Subasinghe <isitha@pipekit.io>
Checklist
Summary
What happened/what you expected to happen?
I expect my workflow to complete successfully even if the workflows controller is restarted. The same issue does not occur if the mutex is on the container template or at the workflow level.
What version are you running?
v3.3.4 is what I've been using for this reproduction, but I've also been able to reproduce it in v3.3.3, v3.3.2, and v3.2.7 with a more complicated workflow.
Diagnostics
This is a workflow that reliably reproduces the issue. The steps I use are:
Pod logs
No pod logs from the main container since it just sleeps and exits.
Wait container logs
Workflow controller logs
Before killing (old workflow controller pod):
logs-before.txt
After killing (new workflow controller pod):
logs-after.txt
Workflow resource
Looking at the workflow after a while, it shows a deadlock in the synchronization section (sorry this doesn't match the logs above, but it's the same structure every time). It's holding and waiting on the same lock.
workflow-resource.txt (uploaded as txt cause GitHub doesn't allow .yaml)
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: