Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows/EC2: ssm-document-worker process remaining after Service-Stop and resulting in cannot access IPC file + DeliveryTimedOut #569

Open
rgoltz opened this issue May 20, 2024 · 5 comments

Comments

@rgoltz
Copy link

rgoltz commented May 20, 2024

Describe the bug

  • At the moment we see a recurring, but intermittently issue for our SSM Agents running on Windows OS (normal EC2 instance).
  • We are using SSM Agent to execute Systems Manager Documents via scheduled Associations (= Run Command).
  • In case we hit the issue, we see for this Target the Detailed Status for this Association in state "DeliveryTimedOut" in AWS Console for this Association Execution in State Manager.

Current Behavior

  • As stated before, we apply a Document to a bunch of (Windows-)targets. A small part results in Detailed Status: DeliveryTimedOut
    00_AwsConsoleTimeout

  • Once we checking the local SSM Agent on Windows, we found following pattern across the affected EC2 instances:

a) 1st, we checked the amazon-ssm-agent.log and found following information/error:

2024-05-20 04:01:15 INFO [CredentialRefresher] Credentials ready
2024-05-20 04:01:15 INFO [CredentialRefresher] Next credential rotation will be in 29.999736375 minutes
2024-05-20 04:10:20 ERROR [amazon-ssm-agent] Error occurred while removing the IPC file: remove C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\surveyor-20240519200113-476: The process cannot access the file because it is being used by another process.
2024-05-20 04:10:44 ERROR [amazon-ssm-agent] message C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\respondent-20240519200116-473 failed to read: open C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\respondent-20240519200116-473: The process cannot access the file because it is being used by another process. 
2024-05-20 04:11:21 ERROR [amazon-ssm-agent] Error occurred while removing the IPC file: remove C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\surveyor-20240519200113-477: The process cannot access the file because it is being used by another process.
2024-05-20 04:12:23 ERROR [amazon-ssm-agent] Error occurred while removing the IPC file: remove C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\surveyor-20240519200113-478: The process cannot access the file because it is being used by another process.
2024-05-20 04:31:15 INFO EC2RoleProvider Successfully connected with instance profile role credentials
2024-05-20 04:31:16 INFO [CredentialRefresher] Credentials ready

b) If we afterwards stop the Windows-Service "Amazon SSM Agent", the process "ssm-document-worker" remaining in the Task-Manager "Process-List"!
01_StoppedWithRunningWorkerSSM

c) If we now start the Windows-Service "Amazon SSM Agent" again, the [headless] "ssm-document-worker" process remaining permanent - The 2nd "ssm-document-worker" is only shown one a Document is executed - Hence as an result there are sometimes two "ssm-document-worker" processes - The issue with DeliveryTimedOut remains:
02_AfterStopStartSSM

It seems this remaining (zombie) "ssm-document-worker" process locking the access to some internal files and blocking further execution of documents/run commands to this target!
Now -again- all Associations/Run Commands to an SSM/Instance is this status, will result in a long time "Pending" and afterwards in Failed with "DeliveryTimedOut".

Workaround:

d) We need to stop the Windows-Service "Amazon SSM Agent" and kill the "ssm-document-worker" process via Task Manager using "End task". Afterwards we start the Windows-Service "Amazon SSM Agent" again and apply the asssociation again. It's working right away (since the permanent, headless "ssm-document-worker" process is gone). Just stop and kill remaining ssm-document-workers.

Expected Behavior:

The instances with those Documents/Associations running this many, many months without this bug-pattern. We assume it could be started with the update from 3.3.380.0 to 3.3.418.0 (in our case 15th May 2024) - But we are not sure about this - At least we see a growing number of issues. Having this said, we do not expect those Delivery TimedOuts at all, in case the Windows Services is in Status Running.

OS Version / Host

OS: Microsoft Windows Server 2019 Datacenter (Platform-Version: 10.0.17763)
Host: EC2 Instance with IMDSv1 (Managed-Instance)

SSM Agent Version

Amazon SSM Agent Version: 3.3.418.0

Other information

I've opened AWS-Case 171620678600565 with SSM-Team. We are share full logs and more details (region, instance-id, etc) with this case. Feel free to request more details here as well - I'll do my best to upload them in an anonymized way.

@gmergulhao
Copy link

gmergulhao commented Jul 31, 2024

Is there any update to this Issue? We are having same issue with currently latest version 3.3.551.0

36 out of 349 Hosts affected.
ssm-document-worker is killed by the agent a couple minutes after restarting. New associations stay pending untill the stuck processes get killed but then is able to execute and finish.

@gmergulhao
Copy link

gmergulhao commented Aug 30, 2024

update: AWS Support has confirmed the bug and the ssm team is reportedly working on it.

@rgoltz
Copy link
Author

rgoltz commented Sep 9, 2024

Hi @gmergulhao - It's good to know that you are not alone with the problem. We are still debugging with AWS Support this issue, which still happens occasionally on our Windows EC2s. The SSM team requested us to collect process-data using Microsoft Tools: procexp64.exe + handle.exe. I'm going to share the requested steps/data here:

On one of the instances with issue, please download the process explorer tool (procexp64.exe) and extract it:

  • After the issue has occurred, please do not stop the SSM Agent.
  • From the extracted path, run the procexp64.exe as administrator
  • Execute any of the SSM Run command which will return the same error on the target instance
  • Once the run command is complete is error, in the process explorer, select view > Show Lower Pane. Select view > Lower Pane View > Handles
  • Search and select “amazon-ssm-agent.exe” > file > save > save the file
  • Select “ssm-agent-worker.exe” > file > save > save the file
  • Select “ssm-document-worker” > file > save > save the file

Additionally, please install the handle tool (handle.exe):

  • Extract the tool and open a command prompt as admin. Navigate to the extracted path
  • Run the command “Handle.exe ssm > c:\ssm_handle.txt”
  • This will generate the file ssm_handle.txt in C drive.

Since all kind of data might help, you maybe try to collect this process-information as well? Did you get more details regarding resolution of this bug? - I hope we can catch the process-dump/-information soon and share it with SSM-team.

@gmergulhao
Copy link

gmergulhao commented Oct 2, 2024

According to aws, release 3.3.987.0 should include a workaround for this issue

@rgoltz
Copy link
Author

rgoltz commented Oct 3, 2024

Yes, SSM-Team released version 3.3.987.0 👍 - The changelog mention:

Use exponential retry for document worker, increase retry interval and attempt count when reading IPC files

We are using the AWS-managed document "AWS-UpdateSSMAgent", which using the AWS-reference ssm-agent-manifest.json internally. In this file the the latest available version is defined - At the moment the new version 3.3.987.0 isn't available yet. We hope to update our SSM agents on Windows soon and report back, if the issue is fixed. I'll try to leave a short note here, once version 3.3.987.0 is availble (at least in eu-central-1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants