Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using gMSA on multiple containers simultaneously causes Domain Trust Relationship to fail #405

Open
avin3sh opened this issue Aug 2, 2023 · 35 comments
Assignees
Labels
🔖 ADO Has corresponding ADO item bug Something isn't working gMSA authentication account across containers P0 Needs attention ASAP

Comments

@avin3sh
Copy link

avin3sh commented Aug 2, 2023

Describe the bug
When running multiple containers simultaneously using the same gMSA on either the same host or different hosts, it causes one or multiple containers to lose their domain trust relationship leading to various issues including LsaLookUp and negotiate auth failures. This especially happens when the count of containers is equal to or more than count of domain controllers in the environment. However, it is also possible to run into this issue when the count of containers is less than count of domain controllers in the environment, provided two or more containers attempt to talk to the same domain controller.

To Reproduce

  1. Build an image from the following Dockerfile
FROM mcr.microsoft.com/dotnet/aspnet:6.0-windowsservercore-ltsc2019 AS base

USER ContainerAdministrator
RUN reg.exe add "HKLM\SYSTEM\CurrentControlSet\Control\Lsa" /v LsaLookupCacheMaxSize /t REG_DWORD /d 0 /f

USER ContainerUser
ENTRYPOINT ["powershell.exe", "1..500 | %{ [void][System.Security.Principal.NTAccount]::new('contoso\\someobj').Translate([System.Security.Principal.SecurityIdentifier]).Value; Start-Sleep -Milliseconds (Get-Random -Minimum 100 -Maximum 1000); }"]

Replace contoso\someobj above with sam name of an actual object.

  1. Run the container image simultaneously on multiple hosts using the following command. To increase the chances of running into the issue, if there are N domain controllers in the environment, run the container image simultaneously on at least N+1 hosts
docker run --security-opt "credentialspec=file://gmsa-credspec.json" --hostname <gMSAName>  -it <image>

Replace <gMSAName> with actual gMSA and file://gmsa-credspec.json with actual gMSA Credential Spec file and <image> with the container image

  1. Monitor the output of all the containers, eventually one or more containers will start throwing the following error message. This usually happens within first few seconds of the container starting, assuming the docker run ... in (2) above was run simultaneously on different hosts. If it does not happen, repeat (2) until it does.

    Exception calling "Translate" with "1" argument(s): "The trust relationship between this workstation and the primary domain failed.

    While a running container is throwing the above error message in its output, exec into it and try performing some domain operation - that will fail as well.

Expected behavior
gMSAs on multiple Windows Containers is officially supported since at least Windows Server 2019. Running gMSA on multiple containers simultaneously should not result in trust relationship to fail.

Configuration:

  • Edition: Windows Server 2022
  • Base Image being used: Windows Server Core
  • Container engine: docker

Additional context

  • While the reproducer uses a PowerShell base image to demonstrate the bug, we had originally run into this issue in an ASP.NET Core web application while performing negotiate authentication.

  • The container image in the reproducer purposefully disables LSA LookUp Cache by setting LsaLookupCacheMaxSize to 0 to simplify the example.

  • If you were to observe traffic of a container that has run into this issue, the packet capture will indicate a lot of DSERPC/RPC_NETLOGON failure messages. You may also observe packets reporting nca_s_fault_sec_pkg_error
    image

  • Sometimes the container may "autorecover". It is purely a chance event. Whenever this happens, you can see RPC_NETLOGON packets in the network capture. Typically this results in the container recovering its domain trust relationship only when the NETLOGON happens through a different domain controller than what container had earlier communicated to.
    image

  • It is also possible to re-establish domain trust relationship of a failing container by running the following command in a failing container (the runtime user should be a ContainerAdministrator or should have administrators privileges)

    nltest.exe /sc_reset:contoso.com

    If the above command does not succeed, you may have to run it more than once. When the command succeeds, more often than not, all the affected containers and not just the current container "recover".

  • As mentioned in the bug description, it is very easy to run into this issue when the count of containers is more than the number of domain controllers in the environment but that is not the only scenario.

  • docker run ... is not the only way to run into this issue. It can be also be reproduced on an orchestration platform like Kubernetes, by setting replicas count of the Deployment to N+1; or by using scaling feature.

@avin3sh avin3sh added the bug Something isn't working label Aug 2, 2023
@ntrappe-msft ntrappe-msft added the triage New and needs attention label Aug 2, 2023
@ntrappe-msft
Copy link
Contributor

Hi, thanks for bringing this issue to our attention. First, I've have to give credit where credit is due. This is so well written up! Thank you for providing a very clear description of the current and expected behavior.

Second, this is a quick question: Is there a reason why all the containers in this cluster all have the same gMSA?

@avin3sh
Copy link
Author

avin3sh commented Aug 7, 2023

Is there a reason why all the containers in this cluster all have the same gMSA?

We actually don't use the same gMSA for all the containers in the cluster. Different type of application containers run with different gMSAs.

The problem arises when there are multiple instances (replicas) of the same application, such as an application that requires to be highly available. During my testing I also found that it does not have to be replicas of same container image/deployment, different containers still running as the same gMSA will also run into this issue.

Multiple containers running as same gMSA can't be avoided for these purposes - without them we can't distribute our workload or promise high availability.

@ntrappe-msft ntrappe-msft added the gMSA authentication account across containers label Aug 9, 2023
@avin3sh
Copy link
Author

avin3sh commented Sep 6, 2023

@ntrappe-msft has there been an internal confirmation of this bug and any discussions on a fix ? This issue severely limits ability to scale Windows containers and use AD authentication because of direct relation between number of containers and domain controllers.

@ntrappe-msft
Copy link
Contributor

Hi, thank you for your patience! We know this is blocking you right now and we're working hard to make sure it's resolved as soon as possible. We've reached out to the gMSA team to get more context on the problem and some troubleshooting suggestions.

@ntrappe-msft
Copy link
Contributor

The gMSA team is still doing their investigation but they can confirm that this is unexpected and unusual behavior. We may ask for some logs in the future if it would help them diagnose the root cause.

@ntrappe-msft ntrappe-msft removed the triage New and needs attention label Oct 26, 2023
@ntrappe-msft
Copy link
Contributor

Hi, could you give us a few follow-up details?

  • Are you using process-isolated or hyper-v isolated containers?
  • Are you using the same container hostname and gMSA name?
  • What is the host OS version?

@avin3sh
Copy link
Author

avin3sh commented Nov 7, 2023

Hi Nicole @ntrappe-msft

Are you using process-isolated or hyper-v isolated containers?

Process Isolation

Are you using the same container hostname and gMSA name?

Correct

What is the host OS version?

Microsoft Windows Server 2022 Standard (Core), with October CU applied

Sharing some more data from our experiments, in case it help the team to troubleshoot the issue:

  1. When all the containers of a gMSA are given a different, unique, value for the hostname, at least the Domain Trust Relationship error goes away - although that may have broken something else, we did not look in that direction. However;

  2. If the value of hostname for each container is >15 characters in length, and the value is unique BUT first 15 characters are not-unique, we again start seeing the issue related to Domain Trust Relationship. This interestingly coincides with 15 character length limit for computer name / NETBIOS limitation.

    This means if you have a very long value of hostname and first few characters are not unique, gMSA issues start occurring in multi-container scenario.

    If you were to use some container orchestration solution, like Kubernetes, the value of pod name, which is what gets supplied as hostname value to the container runtime, is in all the realistic scenarios >15 characters and the first few characters are common for each pod (deployment name + replicaset ID) -- this would cause problem with gMSAs in that case as well

  3. Just out of curiosity, instead of docker runtime, I directly used containerd and I could reproduce the problem there as well

  4. Not specifying hostname when launching containers with same gMSA does not give this error, I believe the container runtime internally gives some random ID as the value for hostname in that case (scenario (1) above) -- that seem to imply the problem here is multiple container having same name ?

    In context of containers with gMSA, having same name as gMSA name has been the norm for a while. Not specifying hostname isn't always possible, explicitly specifying hostname shouldn't break the status quo, and when using orchestration solutions, like the example I listed above, the user has no direct control on the value of hostname.

This issue has been severely restricting usage of Windows Containers at scale :(

@ntrappe-msft
Copy link
Contributor

🔖 ADO 47828389

@avin3sh
Copy link
Author

avin3sh commented Dec 27, 2023

While we appreciate that the Containers team is still looking into this issue, I wanted to share some insights into just how seemingly difficult this problem is to work around.

In order to prevent requests landing on "bad" containers, I was trying to write custom aspnet core health check that could inquire the status of Trust Relationship of the container and mark the service as unhealthy when Domain Trust fails. What seemed to be a very straightforward tempory fix/compromise for our problems turned out to be a complex anomaly:

  • Firstly, netapi32 DLL is not available in nanoserver, and won't be until next major release of Windows Server - Support Kubernetes Go binaries in Nanoserver Images #72 (comment)
  • If we have the Server Core image as the base image and have the DLL moved to the nanoserver container, we could work around this but only to run into more problems
  • Within the gMSA container - the Win32 call will not automatically pick the Netlogon Policy Server
  • And if you do hardcode a domain controller for this purpose, the netlogon query response would still indicate that the trust relationship exists (NERR_Success as opposed to something like RPC_S_SERVER_UNAVAILABLE) - and this is while the container is actively reporting trust errors while performing AD operations
  • And even if we had managed to get all of this to work, to "repair" the Secure Channel we would have to run our container as ContainerAdministrator which introduces bunch of other security concerns
  • PowerShell commands such as Test-ComputerSecureChannel simply fail, because the interpretation of "hostname" is different within a gMSA Container vs. outside of it - where the command is typically used
  • In essence, any of the means to [programmatically] catch gMSA and Domain Trust issues for Containers, like ones documented at https://kubernetes.io/docs/tasks/configure-pod-container/configure-gmsa/#troubleshooting, turned out to be unhelpful

My guesses for why the usual means to troubleshoot gMSA/Trust problems are not working for us is probably an attempted to fix a VERY SIMILAR problem for Containers in Server 2019:

We changed the behavior in Windows Server 2019 to separate the container identity from the machine name, allowing multiple containers to use the same gMSA simultaneously.

Since we do not understand how this was achieved, we have again reached a dead end and are desperately hoping the Containers team is able to solve our gMSA-Containers-At-Scale problem

@ntrappe-msft
Copy link
Contributor

Thanks for the additional details. We've had a number of comments from internal and external teams struggling with the same issue. Our support team is still working to find a workaround that they can publish.

@ntrappe-msft
Copy link
Contributor

Support team is still working on this. We'll make sure we also update our "troubleshoot gMSAs" documentation when we can address the problem.

@israelvaldez
Copy link

israelvaldez commented Feb 28, 2024

We're also running into this issue, we're using Windows Server 2019 container images, however there are no multiple container instances running with the same gMSA however we still get the same error about trust.
Our case is that we try to login with an AD user it doesn't work, but the gMSA does work, should I raise a ticket with support for assistance.

Update:

  • All of our containers have the same host name even if they run using different gMSAs
  • Using a different name for the containers does not solve the issue

@avin3sh
Copy link
Author

avin3sh commented Mar 6, 2024

Hello @ntrappe-msft - is Containers team in touch with the gMSA/CCG group. Our support engineers informed us that we are the only ones who have reported this issue, but based on your confirmation in #405 (comment), and assuming from reactions on this issue, it is clear there are many users who have run into this exact problem.

Our case is that we try to login with an AD user it doesn't work, but the gMSA does work, should I raise a ticket with support for assistance.

@israelvaldez, see my above comment. I would think it is worth highlighting this problem to Microsoft Support from your end as well, so that that it is obvious, without any doubt, that multiple customers face this and it could be appropriately prioritized (if not already)

@WillsonAtJHG
Copy link

Hi @ntrappe-msft we are also experiencing the same issue with our gMSA containers intermittently losing trusts with our domain and needs to be restarted. Wondering if Microsoft has any update on this issue.

We have multiple container instances running the same app and using the gMSA. Interestingly even though each of them have their own unique hostname defined, the log shows it's connecting to the DC using the gMSA name as MachineName. Host/domain/dc names replaced with **.

EventID : 5720
MachineName : gmsa_**
Data : {138, 1, 0, 192}
Index : 1309
Category : (0)
CategoryNumber : 0
EntryType : Error
Message : The session setup to the Windows Domain Controller \** for the domain **
failed because the computer gmsa_** does not have a local security database account.
Source : NETLOGON
ReplacementStrings : {\**, **, **}
InstanceId : 5720
TimeGenerated : 13/03/2024 10:23:24 AM
TimeWritten : 13/03/2024 10:23:24 AM
UserName :
Site :
Container :

@ntrappe-msft
Copy link
Contributor

@avin3sh you are definitely not the only one experiencing this Issue. There are a number of internal teams who would like to increase the severity of this Issue and attention towards it. I'm crossing my fingers that we'll have a positive update soon. But it does help us if more people comment on this threat highlighting that they too are encountering this problem.

@macsux
Copy link

macsux commented Mar 27, 2024

This is a huge issue for us at Broadcom with multiple fortunate 100 customers wanting this feature in one of our products and thousands of workloads being blocked from being migrated off VMs to containers

@israelvaldez
Copy link

In my scenario I created a new gMSA othern than the one I was using (which was not being used in multiple pods) and I was able to workaround this problem.
i.e. my pod had gmsa1, I created gmsa2 and suddenly the trust betweent he pod and the domain was fine.

@julesroussel3
Copy link

The workaround is appreciated, but we would like to see Microsoft fix this issue directly so that customers do not need to significantly redesign their environments.

@avin3sh
Copy link
Author

avin3sh commented May 3, 2024

This issue has been fortunate enough to not get attention of auto-reminder bots so far, but I am afraid they will be here anytime soon. I see this has been finally assigned, does it mean a fix is in the works ?

@julesroussel3
Copy link

julesroussel3 commented Jun 4, 2024 via email

@ntrappe-msft ntrappe-msft added the 🔖 ADO Has corresponding ADO item label Jun 25, 2024
@avin3sh
Copy link
Author

avin3sh commented Jul 26, 2024

We have started seeing a new issue with nanoserver images released April onwards (build 20348.2402+), the HTTP service running inside the container has started throwing 'System.Net.InternalException' was thrown. -1073741428, which, as per someone in the .NET platforms, translates to The trust relationship between the primary domain and the trusted domain failed. (see: dotnet/runtime#105567 (comment))

As a result, all our new containers are failing to serve ANY incoming kerberized requests!! This is no longer intermittent. This is no longer about number of containers running simultaneously with a gMSA. This is straight up fatal error rendering the container pretty much unusable.

Now one would think "downgrading" to an older nanoserver image released prior to April would fix this ? Wrong. That would make the problem even more worse because of the another unresolved Windows-Containers issue - #502 -- downgrading will potentially cause all the container infrastructure to BSOD!!!

To summarize,

  • the original issue remains unresolved!
  • April onwards, you can't use latest or even older nanoserver images
  • apps built off newer images are pretty much incapable of Kerberos
  • going back to images built off March or prior CUs has potential to cause your container host to go in BSOD

This issue desperately needs a fix. It's almost as if you can't use Windows Containers for any of your gMSA and Active Directory use cases anymore!

@KristofKlein
Copy link

KristofKlein commented Aug 9, 2024

We are facing also similar issue on the usage of gMSA within scaling windows containers. We also provide hostname into the container creation, but in fact due to gMSA containers are identify themself as the gMSA name. This leads to mismatches on our backend that tries to keep track on incoming traffic. It gets heavily confused while all request are coming from the same "machine". Of course, as long as I only have one container started making use of the one gMSA I am all good. the moment I scale it crashes. (fun fact: the product that gets confused is also from Microsoft :P)
image

So also curious what will happen to this :)

Ultimately, this is what kills me (from here )
image

Can't it put the container/hostname as suffix or so ? :D

@ntrappe-msft ntrappe-msft added the P0 Needs attention ASAP label Aug 27, 2024
@NickVanRaaijT
Copy link

NickVanRaaijT commented Sep 2, 2024

We appear to maybe be facing a similair issue "The trust relationship between the primary domain and the trusted domain failed" on our AKS cluster. Is this being worked on?

@vrapolinario
Copy link
Contributor

Quick question on the environment you folks have on which you are seeing this issue: Is NETBIOS enabled in your environment? NETBIOS uses port 137,138, and 139, with 139 being Netlogon. I have tested this with a customer (who was kind enough to validate their environment) on which a deployment with multiple pods worked normally. This customer has NETBIOS disabled and port 139 between pods/AKS cluster is blocked to the Domain Controllers.

I'm not saying this is a fix, but wanted to check if others see this error even with NETBIOS disabled or the port blocked.

@avin3sh
Copy link
Author

avin3sh commented Sep 11, 2024

From what I have found (I can do a more thorough test later), NETBIOS is disabled on the container host's primary interface and on the HNS management vNIC (we use Calico in VXLAN mode). However, the vNICs for individual pods show NETBIOS as enabled. We haven't done anything to block traffic on Port 139.

Do you suggest we perform a test after disabling NETBIOS on Pod vNICs as well; AND blocking Port 139 ? I am not sure how to configure this within CNI but perhaps I can write some script to disable netbios by making registry change after the container is network has come up, unless you have some script handy that you could share.

BTW just to reiterate the severity from my earlier comment #405 (comment) - nanoserver images after March 2024 have made this problem worse. Earlier the issue was intermittent and dependent on some environmental factors but March 2024+ nanoserver images are causing 100% failures.

@vrapolinario
Copy link
Contributor

Thanks @avin3sh for the note. No need for a fancy script or worrying from the cluster/pod side - if you block port 139 at the network/NSG level, this should help validate. Again, I'm asking here as a validation, we haven't been able to narrow it down yet, but we have customers running multiple containers simultaneously with no errors and I noticed they have NETBIOS disabled AND port 139 blocked.

As for the Nano Server issue, can you please clarify: The issue happens even if you launch just one container? You're saying gMSA is not working on Nano Server at all?

@avin3sh
Copy link
Author

avin3sh commented Sep 11, 2024

Thank you so much for clarifying. I will share my observation after blocking traffic on port 139.

As for the Nano Server issue, can you please clarify: The issue happens even if you launch just one container? You're saying gMSA is not working on Nano Server at all?

We have a bunch of ASP.NET services. We use Negotiate/Kerberos authentication middleware. If I use an ASP.NET nanoserver image that is using Windows build from March April 2024 or later, the Kerberos token exchange is straight up failing and no request is able to get authenticated. You can see SSPI blob exchange functions in the error call stack - see here for the full call stack -> dotnet/runtime#105567 (comment)

So essentially our web services are not able authenticate using negotiate when using any image from April or later. This does not happen if I launch just one container, but it happens 100% if there are multiple containers. I think I haven't seen this behavior in beefier windowserver image but can't say for sure as we don't generally use them due to their large size.

I have also seen varying behavior depending on whether the container user is ContainerUser or NT AUTHORITY\NetworkService - the issue exists in both the scenarios but manifests differently.

@macsux
Copy link

macsux commented Sep 11, 2024

@avin3sh a little off topic, but you may want to look at my project that can seamlessly translate tokens from jwt to kerberos and vice versa. It's often used as sidecar and it doesn't require container to be domain joined - it uses kerberos.net library under the covers which is a managed implementation instead of relying on sspi.

https://github.com/NMica/NMica.Security

@avin3sh
Copy link
Author

avin3sh commented Sep 12, 2024

@vrapolinario I tried this with Port 139 blocked like so (for TCP, UDP, Inbound and Outbound):

New-NetFirewallRule -DisplayName "Block Port 139" -Direction Inbound -LocalPort 139 -Protocol TCP -Action Block

But the problem persisted.

Any chance the customer who tried this had large number of domain controllers in their environment ? We have seen that as long as your deployment replicas is less than or equal to number of domain controllers in the environment, you typically don't run into this issue.

@avin3sh
Copy link
Author

avin3sh commented Sep 12, 2024

We are happy to collaborate with you to test out various scenarios/experimental patches/etc. We already have a Microsoft Support case ongoing (@ntrappe-msft may be familiar) but it hasn't moved in several months - if you want to take a look at our case, more than willing to validate any suggestions that you may have for this problem.

@vrapolinario
Copy link
Contributor

I believe I'm aware of the internal support case and I reached out to the team with this note as well. They are now running some internal tests as well, but I haven't heard back from them. The main thing I wanted you all here to please evaluate is if your environment is for some reason using NETBIOS. The fact that some of you reported the DCs getting the same hostname from the pod requests with a character limit of 15 tells me there's some NETBIOS communication happening.

https://learn.microsoft.com/en-us/windows-server/identity/ad-ds/manage/dc-locator-changes

By default, DNS should be in use, so if you only see 15 characters in the hostnames going to the DCs, tells me something is off. By disabling NETBIOS or blocking port 139, you can quickly check if this helps solve the issue you are seeing.

@avin3sh
Copy link
Author

avin3sh commented Sep 13, 2024

Blocking TCP/139 hasn't helped. I also tried blocking UDP/137 and UDP/138 out of curiosity but that does not seem to have made any difference either.

I started a packet capture before even pods came up and reproduced the scenario, I don't see any packets going on TCP/139
image

There is bunch of chatter on TCP/135 - RPC - but of course I can't block it without disrupting other things on the host.

There are indeed RPC_NETLOGON (per wireshark) packets originating from the containers during this time, but that's over random high numbered ports, taking us back to my very first update. I believe this is just netlogon RPC happening over a port picked from 49152 - 65535
image

Let me know if you want me to try something else.

@vrapolinario
Copy link
Contributor

Thank you for the validation. We actually ran the same test last night, but I didn't have time to reply here. I can confirm that blocking TCP 139 won't solve the problem. Microsoft still recommends moving away from NETBIOS unless you need it for compatibility, but this is not the issue here.

We're still investigating this internally and will report back.

As for the Nano Server image issue, can I ask you to please open a separate issue here so we can investigate? These seem like Teo separate problems that are unrelated. The fact that you can't make the nano server ima work at all indicates a different root cause.

@avin3sh
Copy link
Author

avin3sh commented Sep 15, 2024

As for the Nano Server image issue, can I ask you to please open a separate issue here so we can investigate? These seem like Teo separate problems that are unrelated. The fact that you can't make the nano server ima work at all indicates a different root cause.

@vrapolinario I have create a new issue #537 with the exact steps to reproduce the bug. It's a simple aspnetcore web app with minimal api with Kerberos enabled. Given the error message is related to the domain trust failure and it does not happen when using NTLM but only Kerberos, I strongly feel it may be related to the larger gMSA issue being discussed here but I will wait for your analysis.

@NickVanRaaijT
Copy link

NickVanRaaijT commented Sep 23, 2024

I've followed this guide on a new cluster https://learn.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/gmsa-aks-ps-module

It results in the same error 1786 0x6fa ERROR_NO_TRUST_LSA_SECRET

This is with a AD server running Windows Server 2016 and a AKS cluster with windows server 2019 nodes with GMSA enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔖 ADO Has corresponding ADO item bug Something isn't working gMSA authentication account across containers P0 Needs attention ASAP
Projects
None yet
Development

No branches or pull requests

10 participants