Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod crash on startup in StatefulSet with Consul Clustering #52

Open
wangjia184 opened this issue Jun 12, 2021 · 2 comments
Open

Pod crash on startup in StatefulSet with Consul Clustering #52

wangjia184 opened this issue Jun 12, 2021 · 2 comments

Comments

@wangjia184
Copy link

wangjia184 commented Jun 12, 2021

Version 3.4.3.

siloBuilder
     .UseConsulClustering(opt =>
     {
         opt.Address = new Uri(AppConfig.Orleans.ConsulUrl);
         opt.AclClientToken = AppConfig.Orleans.AclClientToken;
     })
     .UseKubernetesHosting();

I configured the labels and environment variables for my POD accordingly to the doc.

          - name: ORLEANS_SERVICE_ID #Required by Orleans 
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['orleans/serviceId']
          - name: ORLEANS_CLUSTER_ID #Required by Orleans 
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['orleans/clusterId']
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['statefulset.kubernetes.io/pod-name']
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP

Running Orleans in K8S StatefulSet, my CI tool deploys the K8S StatefulSet, and then it crashes on startup.

System.AggregateException: One or more errors occurred. (Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184])
 ---> Orleans.Runtime.MembershipService.OrleansClusterConnectivityCheckFailedException: Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184]
   at Orleans.Runtime.MembershipService.MembershipAgent.ValidateInitialConnectivity()
   at Orleans.Runtime.MembershipService.MembershipAgent.BecomeActive()
   at Orleans.Runtime.MembershipService.MembershipAgent.<>c__DisplayClass26_0.<<Orleans-ILifecycleParticipant<Orleans-Runtime-ISiloLifecycle>-Participate>g__OnBecomeActiveStart|6>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct)
   at Orleans.LifecycleSubject.OnStart(CancellationToken ct)
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()
   at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken)
   at Orleans.Hosting.SiloHost.StartAsync(CancellationToken cancellationToken)
   at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
   at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
   at UBS.OrleansServer.EntryPoint.Start() in /app/UBS/OrleansServer/EntryPoint.cs:line 102
   --- End of inner exception stack trace ---
@wangjia184
Copy link
Author

wangjia184 commented Jun 12, 2021

Tried to set StatefulSet's replica to 3, all PODs crashed on startup. Even with empty consul, no key/values pre-exists before starting the PODs in StatefulSet

fail: Orleans.Runtime.MembershipService.MembershipAgent[100661]
      Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Silos which did not respond successfully are: [S10.18.123.235:11111:361177868]. Will continue attempting to validate connectivity until 06/12/2021 07:19:33. Attempt #7

After PODs restarted over and over again, finally they stablize down and all start up. Please see RESTARTS column below.

NAME                                 READY   STATUS             RESTARTS   AGE
ubs-job-dev-0                        1/1     Running            4          17m
ubs-job-dev-1                        1/1     Running            4          16m
ubs-job-dev-2                        1/1     Running            3          16m

Log says 7 silos.

ProcessTableUpdate (called from TryUpdateMyStatusGlobalOnce) membership table: 7 silos, 3 are Active, 4 are Dead, Version=<33, 31015>. All silos: [SiloAddress=S10.18.123.246:11111:361178481 SiloName=ubs-job-dev-0 Status=Active, SiloAddress=S10.18.123.199:11111:361178519 SiloName=ubs-job-dev-1 Status=Active, SiloAddress=S10.18.117.114:11111:361178416 SiloName=ubs-job-dev-2 Status=Active, SiloAddress=S10.18.117.114:11111:361178292 SiloName=ubs-job-dev-2 Status=Dead, SiloAddress=S10.18.123.199:11111:361178366 SiloName=ubs-job-dev-1 Status=Dead, SiloAddress=S10.18.123.235:11111:361177868 SiloName=ubs-job-dev-0 Status=Dead, SiloAddress=S10.18.123.246:11111:361178329 SiloName=ubs-job-dev-0 Status=Dead]

And this is how it looks in Consul:
image

There are only 3 PODs in this StatefulSet while log says 7 silos. The SiloName is the pod name, unlike ReplicaSet, pod name in StatefulSet does not change after POD restart, It seems POD cannot see others on startup, then it crashes. StatefulSet restarted the crashed POD, the newly-started POD with the same pod name is seen as a new Silo.

@wangjia184 wangjia184 changed the title Crashes after several restart - Failed to get ping responses from 1 of 1 active silos Crash on startup in StatefulSet with Consul Clustering Jun 12, 2021
@wangjia184 wangjia184 changed the title Crash on startup in StatefulSet with Consul Clustering Pod crash on startup in StatefulSet with Consul Clustering Jun 12, 2021
@xontab
Copy link
Contributor

xontab commented Nov 23, 2021

Are you using K8s membership via UseKubeMembership() extension method? Looks like in the examples above you are only using official Orleans libraries such as Microsoft.Orleans.OrleansConsulUtils and Microsoft.Orleans.Hosting.Kubernetes. If so you need to report this issue to the official Orleans project i.e. https://github.com/dotnet/orleans

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants