Support for startup sequence #104

ahg-g · 2023-05-02T03:51:14Z

What would you like to be added:
A way to indicate if the list of Jobs should start in a specific order.

Why is this needed:
Some jobs require a specific start sequence of pods; sometimes the driver is expected to start first (like Ray or Spark), in other cases the workers are expected to be ready before starting the driver (like MPI).

ahg-g · 2023-05-02T03:55:47Z

@vsoch can you please share your thoughts on startup sequence?

I get that in some cases we want a specific job in the set to start first; however, are there requirements on how to handle failures? or should startup sequence be enforced only when the JobSet is first created.

ahg-g · 2023-05-02T04:00:34Z

I am currently leaning towards a solution based on initContainers or sidecars that implements startup sequence (and in-place pod restarts to recover from failures)

vsoch · 2023-05-02T05:09:56Z

@vsoch can you please share your thoughts on startup sequence?

Yes of course! For our setup (which you can generalize to a main leader and workers) we would always want the leader to start first, so when the workers come up they register to it without issue. We are getting this to work now with basically hacks in the main startup script to ensure the worker tries again if it doesn't connect. I'm not sure how I'd accomplish this with initContainers or sidecar because every pod is fully part of the cluster (e.g., not launcher) and I'd stay away from additional sidecar because I think for the kueue design, they didn't like that. But I think I have something interesting to discuss (next section)!

Startup Sequence for Networking?

However, I'm running into some really interesting behavior tonight - it might not be related to this issue (but does relate to starting up an indexed job) so I want to share. Basically, last week we re-designed the operator to not have a certificate generator pod, which was one we created before the indexed job just to generate a certificate for the cluster that would be saved to a ConfigMap. We were able to do this by building zeromq directly into the operator. The weird behavior is that after that change, I've seen the cluster startup + running time (meaning the time it takes the broker to connect to all the pods and complete LAMMPS) slow down, sometimes alarmingly (today I saw 140 seconds when it used to be 11-12!) We were able to figure out that zeromq (what connects the workers to the broker) has an exponential backoff, and so rebuilding flux and setting this to a lower value (5 seconds) I was able to bring 140s down to 20s. However, what I'm puzzled by (and the million dollar question!) is why the network is no longer ready to go, and before it was. Removing a one-off pod that ran before the indexed job shouldn't have influence, but it does! I can look at logs of previous runs (before the change) and see that there is no time needed to wait for the network created by the headless service, and the average before was ~12 seconds (not ~19-20).

But I think I've at least started to figure this out, or at least make an association with create a one-off pod before an indexed job. As an example, here are operator times without any one-off pod (and this includes the fix to zeromq):

(that was a run of 20 but I had to remove the first time because it included the container pull). Notice the average and standard deviation.

Now I've added an nginx service container, which does nothing but exist, and is brought up in the same networking space (with the headless service) as the indexed job.

[12.358600854873657, 11.090389728546143, 11.065520763397217, 14.952529907226562, 16.31449556350708, 15.119110107421875, 15.652220249176025, 15.399735927581787, 10.43320083618164, 14.983035564422607, 11.204341650009155, 17.04622173309326, 15.252372741699219, 15.41066312789917, 21.148316144943237, 20.326151847839355, 15.42909049987793, 24.979089975357056, 31.33288335800171, 14.95423173904419]
Mean: 16.222610116004944
Std: 5.039630919413453

The standard deviation is about the same, but the startup +running time is ~3.5 seconds faster. That probably shouldn't happen, or it shouldn't be the case that other cluster objects are influenced my indexed job networking.

Now - this was done with a traditional Kubernetes cluster in GKE (I didn't use core-dns which @alculquicondor advised me to use! But I'm going to test there next.

The TLDR is that I think there is something going on here relevant to creating a headless service and the indexed job pods, and if a batch user does that, we might design it in a way so we can be confident that the cluster-level network is ready to go when the pods are. This is likely not a "user controlled startup sequence" and more of an internal control, but it's really important and I wanted to bring it up.

vsoch · 2023-05-02T06:12:30Z

Wow this is unbelievable! Here is core-dns (still with the extra service pod):

[14.259131908416748, 12.555635213851929, 12.192086219787598, 12.01100206375122, 10.97166109085083, 11.45758605003357, 11.914217233657837, 11.017939567565918, 11.805388689041138, 11.476716041564941, 11.758479118347168, 11.720443964004517, 11.042527914047241, 11.698939561843872, 11.32136607170105, 10.888871908187866, 11.059098958969116, 11.843417882919312, 11.530989170074463, 11.510135889053345]
Mean: 11.701781725883484
Std: 0.7439125645998699

and here is core-dns without it:

[16.587246417999268, 13.604760646820068, 15.725329160690308, 11.451491117477417, 15.68161916732788, 17.88189125061035, 16.28657817840576, 11.681580543518066, 11.986695528030396, 11.731800556182861, 11.154186725616455, 16.223593711853027, 15.519461393356323, 15.18950366973877, 16.399006128311157, 15.302581071853638, 12.93296504020691, 15.491178512573242, 16.46683144569397, 11.697750806808472]
Mean: 14.449802553653717
Std: 2.1635772298895435

so even with core-dns, we see a slowdown without the "warmup" pod (as I'm calling it!)

alculquicondor · 2023-05-02T13:38:53Z

Could you open a separate issue for networking on kubernetes/kubernetes? If possible, provide a repeatable scenario using just k8s APIs (service, Pod, Job), so that people that are not experts on batch, but are experts on networking, can reproduce.

this was done with a traditional Kubernetes cluster in GKE

There are 2 supported DNS stacks in GKE: kube-dns and clouddns. You shouldn't need core-dns. I suggested core-dns for your minikube experiments :)
https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--cluster-dns

aojea · 2023-05-02T13:49:58Z

traditional Kubernetes cluster in GKE

there are many varieties of traditional 😄 , specially about DNS and pod network (DPv1 or DPv2)

is why the network is no longer ready to go, and before it was

Ok, we need more data here, is the IP connectivity that is slower or the DNS resolution?
which type of communication is exactly slower, from one pod to another, from one pod to a service, ...?

vsoch · 2023-05-02T13:53:03Z

Could you open a separate issue for networking on kubernetes/kubernetes? If possible, provide a repeatable scenario using just k8s APIs (service, Pod, Job), so that people that are not experts on batch, but are experts on networking, can reproduce.

Sure, will add to my TODO! It's not at the top but I can do it in the next week or so.

There are 2 supported DNS stacks in GKE: kube-dns and clouddns. You shouldn't need core-dns. I suggested core-dns for your minikube experiments :)

Gotcha, but you mentioned the first is / is going to be deprecated? It seems to kind of be terrible, at least compared to the second!

Ok, we need more data here, is the IP connectivity that is slower or the DNS resolution?
which type of communication is exactly slower, from one pod to another, from one pod to a service, ...?

Basically from the pod, if you did a basic ping of the fully qualified domain name, when we have the service pod created beforehand (which is also added to the headless service on the cluster) the DNS is available sooner. We aren't doing a ping explicitly, we are using zeromq to make the connection, which will connect immediately if it's there (and fall back to a timeout to try again if not). Sorry I'm not experienced enough with DNS to answer your question, but if you can give me exactly the commands / example to test I can try to set this up in a reproducing example. Technically speaking, you could reproduce it with the Flux operator, but probably developers don't want to install an operator and then run a script or commands (unless they are open to that, in which case I can write that up - we already have the testing script!)

vsoch · 2023-05-05T04:26:41Z

I wanted to update (for the network issue above) I ran some experiments last night that show the issue: https://twitter.com/vsoch/status/1654249315580387329 I still need to reproduce / make a dummy use case outside of the operator.

aojea · 2023-05-05T05:03:17Z

Basically from the pod, if you did a basic ping of the fully qualified domain name, when we have the service pod created beforehand (which is also added to the headless service on the cluster) the DNS is available sooner.

If we can simplify the problem of the bootstrap to the minimum components involved it will be easier, let me see if I understand, the startup depends on a Service, and this Service has a single Pod, a Deployment?

The DNS or the ClusterIP will not be "available" until the Pods the Service references are running and ready, knowing the state of the EndpointSlices associated at each point in time will you this information, since the DNS and the kube-proxy depend on them to provide the connectivity

vsoch · 2023-05-05T05:32:29Z

The setup for the operator creates:

an indexed job
a few config maps
a headless service

and before we had a one off service pod (to generate an asset for the config map) that was:

a pod
an indexed job
a few config maps
a headless service

For that configuration, the network was always ready by the time the broker (that needs to see the other pods in the indexed job) came up. When we removed that (the first configuration above). The network was never ready. When I emulated having this pod (the experiment plot above) it reproduced the error - so I think it's the general extra pod that is doing something.

What I'll try when I get a chance is to make a very bare bones operator that reproduces the above, and ideally I can delete components from the flux operator until (maybe?) the pattern I see above goes away, but not sure, haven't tried it yet! I'll definitely update here when I get a chance. The above has a linked repository (with the Flux Operator) that technically reproduces the issue (and I wrote out instructions) but ideally if I open an issue somewhere I can derive a much more minimal reproducing case.

aojea · 2023-05-05T06:42:33Z

a headless service

this is the one that is going to generate the DNS records for your services, so what pods match this service selector?

vsoch · 2023-05-05T06:49:44Z

this is the one that is going to generate the DNS records for your services, so what pods match this service selector?

For the original certificate generator, it was only the indexed job pods that matched - not the certificate generator pod. For the testing cases above (which I used to reproduce a similar case in the operator because you can specify a sidecar service container on the same network) that pod is also part of the same headless service.

alculquicondor · 2023-05-05T12:32:07Z

What is the order of creation? Is it the following?

service
(exists in one case, but not the other) random-pod, matches the service
indexed job, matches the service.

At some point you are saying that you were creating a random service, but then it looks like you meant pod?

alculquicondor · 2023-05-05T12:34:17Z

I would suggest opening an issue in k/k, even if you don't have all the details. You can always post a minimal reproducer later.
Let's stop derailing this thread about job startup sequence :)

vsoch · 2023-05-05T13:59:28Z

Sorry @alculquicondor ! 😆 I can definitely do that - I brought it up originally because it does seem somewhat related, and have continued to share updates because it's a pretty cool bug :)You are right it's not totally in the right place - I will find another spot if/when I can follow up.

kannon92 · 2023-07-19T16:53:32Z

To go back to topic at hand,

Could we follow a similar API as we have for SuccessPolicy or FailurePolicy?

type StartupPolicy struct {
	// Operator determines either All or Any of the selected jobs should be ready to consider the replicatedjob ready
	// +kubebuilder:validation:Enum=All;Any
	Operator Operator `json:"operator"`

	// TargetReplicatedJobs are the names of the replicated jobs the operator will apply to.
	// A null or empty list will apply to all replicatedJobs.
	TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`
}

I was thinking of assuming the TargetReplicatedJobs list corresponds to the order of replicated jobs we want to be ready?

One question I have is would we want the other jobs to be started in a suspended state and unsuspend them once startupPolicy is successful

kannon92 · 2023-08-03T13:55:20Z

I wanted to take some of the ideas we discuss and consolidate into a KEP.

/assign

Not sure I'll be able to take the implementation but wanted to draft a potential API and see!

k8s-triage-robot · 2024-01-26T00:22:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ahg-g · 2024-01-26T00:29:40Z

/remove-lifecycle stale

danielvegamyhre · 2024-02-13T07:01:36Z

Completed in #246

vsoch mentioned this issue May 2, 2023

broker connect timeout should be tunable flux-framework/flux-core#5134

Closed

vsoch mentioned this issue May 5, 2023

Network readiness for indexed job + headless service depends on presence of pod kubernetes/kubernetes#117819

Closed

This was referenced Jul 29, 2023

Question: Do you plan to create a pipeline of Jobs? #240

Closed

add kep template #242

Merged

k8s-ci-robot assigned kannon92 Aug 3, 2023

kannon92 mentioned this issue Aug 4, 2023

☂️ Requirements for v0.3.0 release #239

Closed

12 tasks

kannon92 mentioned this issue Sep 15, 2023

Declare dependency relation between the jobs in a jobset #296

Closed

danielvegamyhre mentioned this issue Dec 13, 2023

☂️ Requirements for v0.4.0 release #350

Closed

7 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024

danielvegamyhre closed this as completed Feb 13, 2024

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for startup sequence #104

Support for startup sequence #104

ahg-g commented May 2, 2023

ahg-g commented May 2, 2023

ahg-g commented May 2, 2023

vsoch commented May 2, 2023

vsoch commented May 2, 2023

alculquicondor commented May 2, 2023

aojea commented May 2, 2023

vsoch commented May 2, 2023

vsoch commented May 5, 2023 •

edited

Loading

aojea commented May 5, 2023

vsoch commented May 5, 2023

aojea commented May 5, 2023

vsoch commented May 5, 2023

alculquicondor commented May 5, 2023

alculquicondor commented May 5, 2023

vsoch commented May 5, 2023

kannon92 commented Jul 19, 2023 •

edited

Loading

kannon92 commented Aug 3, 2023

k8s-triage-robot commented Jan 26, 2024

ahg-g commented Jan 26, 2024

danielvegamyhre commented Feb 13, 2024

Support for startup sequence #104

Support for startup sequence #104

Comments

ahg-g commented May 2, 2023

ahg-g commented May 2, 2023

ahg-g commented May 2, 2023

vsoch commented May 2, 2023

Startup Sequence for Networking?

vsoch commented May 2, 2023

alculquicondor commented May 2, 2023

aojea commented May 2, 2023

vsoch commented May 2, 2023

vsoch commented May 5, 2023 • edited Loading

aojea commented May 5, 2023

vsoch commented May 5, 2023

aojea commented May 5, 2023

vsoch commented May 5, 2023

alculquicondor commented May 5, 2023

alculquicondor commented May 5, 2023

vsoch commented May 5, 2023

kannon92 commented Jul 19, 2023 • edited Loading

kannon92 commented Aug 3, 2023

k8s-triage-robot commented Jan 26, 2024

ahg-g commented Jan 26, 2024

danielvegamyhre commented Feb 13, 2024

vsoch commented May 5, 2023 •

edited

Loading

kannon92 commented Jul 19, 2023 •

edited

Loading