-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows/EC2: ssm-document-worker process remaining after Service-Stop and resulting in cannot access IPC file + DeliveryTimedOut #569
Comments
Is there any update to this Issue? We are having same issue with currently latest version 3.3.551.0 36 out of 349 Hosts affected. |
update: AWS Support has confirmed the bug and the ssm team is reportedly working on it. |
Hi @gmergulhao - It's good to know that you are not alone with the problem. We are still debugging with AWS Support this issue, which still happens occasionally on our Windows EC2s. The SSM team requested us to collect process-data using Microsoft Tools: procexp64.exe + handle.exe. I'm going to share the requested steps/data here: On one of the instances with issue, please download the process explorer tool (procexp64.exe) and extract it:
Additionally, please install the handle tool (handle.exe):
Since all kind of data might help, you maybe try to collect this process-information as well? Did you get more details regarding resolution of this bug? - I hope we can catch the process-dump/-information soon and share it with SSM-team. |
According to aws, release 3.3.987.0 should include a workaround for this issue |
Yes, SSM-Team released version 3.3.987.0 👍 - The changelog mention:
We are using the AWS-managed document "AWS-UpdateSSMAgent", which using the AWS-reference ssm-agent-manifest.json internally. In this file the the latest available version is defined - At the moment the new version 3.3.987.0 isn't available yet. We hope to update our SSM agents on Windows soon and report back, if the issue is fixed. I'll try to leave a short note here, once version 3.3.987.0 is availble (at least in eu-central-1). |
Describe the bug
Current Behavior
As stated before, we apply a Document to a bunch of (Windows-)targets. A small part results in Detailed Status:
DeliveryTimedOut
Once we checking the local SSM Agent on Windows, we found following pattern across the affected EC2 instances:
a) 1st, we checked the
amazon-ssm-agent.log
and found following information/error:b) If we afterwards stop the Windows-Service "Amazon SSM Agent", the process "
ssm-document-worker
" remaining in the Task-Manager "Process-List"!c) If we now start the Windows-Service "Amazon SSM Agent" again, the [headless] "
ssm-document-worker
" process remaining permanent - The 2nd "ssm-document-worker
" is only shown one a Document is executed - Hence as an result there are sometimes two "ssm-document-worker
" processes - The issue with DeliveryTimedOut remains:It seems this remaining (zombie) "
ssm-document-worker
" process locking the access to some internal files and blocking further execution of documents/run commands to this target!Now -again- all Associations/Run Commands to an SSM/Instance is this status, will result in a long time "Pending" and afterwards in Failed with "DeliveryTimedOut".
Workaround:
d) We need to stop the Windows-Service "Amazon SSM Agent" and kill the "
ssm-document-worker
" process via Task Manager using "End task". Afterwards we start the Windows-Service "Amazon SSM Agent" again and apply the asssociation again. It's working right away (since the permanent, headless "ssm-document-worker" process is gone). Just stop and kill remaining ssm-document-workers.Expected Behavior:
The instances with those Documents/Associations running this many, many months without this bug-pattern. We assume it could be started with the update from 3.3.380.0 to 3.3.418.0 (in our case 15th May 2024) - But we are not sure about this - At least we see a growing number of issues. Having this said, we do not expect those Delivery TimedOuts at all, in case the Windows Services is in Status Running.
OS Version / Host
OS: Microsoft Windows Server 2019 Datacenter (Platform-Version: 10.0.17763)
Host: EC2 Instance with IMDSv1 (Managed-Instance)
SSM Agent Version
Amazon SSM Agent Version: 3.3.418.0
Other information
I've opened AWS-Case 171620678600565 with SSM-Team. We are share full logs and more details (region, instance-id, etc) with this case. Feel free to request more details here as well - I'll do my best to upload them in an anonymized way.
The text was updated successfully, but these errors were encountered: