Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for startup sequence #104

Closed
Tracked by #239 ...
ahg-g opened this issue May 2, 2023 · 20 comments
Closed
Tracked by #239 ...

Support for startup sequence #104

ahg-g opened this issue May 2, 2023 · 20 comments
Assignees

Comments

@ahg-g
Copy link
Contributor

ahg-g commented May 2, 2023

What would you like to be added:
A way to indicate if the list of Jobs should start in a specific order.

Why is this needed:
Some jobs require a specific start sequence of pods; sometimes the driver is expected to start first (like Ray or Spark), in other cases the workers are expected to be ready before starting the driver (like MPI).

@ahg-g
Copy link
Contributor Author

ahg-g commented May 2, 2023

@vsoch can you please share your thoughts on startup sequence?

I get that in some cases we want a specific job in the set to start first; however, are there requirements on how to handle failures? or should startup sequence be enforced only when the JobSet is first created.

@ahg-g
Copy link
Contributor Author

ahg-g commented May 2, 2023

I am currently leaning towards a solution based on initContainers or sidecars that implements startup sequence (and in-place pod restarts to recover from failures)

@vsoch
Copy link
Contributor

vsoch commented May 2, 2023

@vsoch can you please share your thoughts on startup sequence?

Yes of course! For our setup (which you can generalize to a main leader and workers) we would always want the leader to start first, so when the workers come up they register to it without issue. We are getting this to work now with basically hacks in the main startup script to ensure the worker tries again if it doesn't connect. I'm not sure how I'd accomplish this with initContainers or sidecar because every pod is fully part of the cluster (e.g., not launcher) and I'd stay away from additional sidecar because I think for the kueue design, they didn't like that. But I think I have something interesting to discuss (next section)!

Startup Sequence for Networking?

However, I'm running into some really interesting behavior tonight - it might not be related to this issue (but does relate to starting up an indexed job) so I want to share. Basically, last week we re-designed the operator to not have a certificate generator pod, which was one we created before the indexed job just to generate a certificate for the cluster that would be saved to a ConfigMap. We were able to do this by building zeromq directly into the operator. The weird behavior is that after that change, I've seen the cluster startup + running time (meaning the time it takes the broker to connect to all the pods and complete LAMMPS) slow down, sometimes alarmingly (today I saw 140 seconds when it used to be 11-12!) We were able to figure out that zeromq (what connects the workers to the broker) has an exponential backoff, and so rebuilding flux and setting this to a lower value (5 seconds) I was able to bring 140s down to 20s. However, what I'm puzzled by (and the million dollar question!) is why the network is no longer ready to go, and before it was. Removing a one-off pod that ran before the indexed job shouldn't have influence, but it does! I can look at logs of previous runs (before the change) and see that there is no time needed to wait for the network created by the headless service, and the average before was ~12 seconds (not ~19-20).

But I think I've at least started to figure this out, or at least make an association with create a one-off pod before an indexed job. As an example, here are operator times without any one-off pod (and this includes the fix to zeromq):

image

(that was a run of 20 but I had to remove the first time because it included the container pull). Notice the average and standard deviation.

Now I've added an nginx service container, which does nothing but exist, and is brought up in the same networking space (with the headless service) as the indexed job.

[12.358600854873657, 11.090389728546143, 11.065520763397217, 14.952529907226562, 16.31449556350708, 15.119110107421875, 15.652220249176025, 15.399735927581787, 10.43320083618164, 14.983035564422607, 11.204341650009155, 17.04622173309326, 15.252372741699219, 15.41066312789917, 21.148316144943237, 20.326151847839355, 15.42909049987793, 24.979089975357056, 31.33288335800171, 14.95423173904419]
Mean: 16.222610116004944
Std: 5.039630919413453

The standard deviation is about the same, but the startup +running time is ~3.5 seconds faster. That probably shouldn't happen, or it shouldn't be the case that other cluster objects are influenced my indexed job networking.

Now - this was done with a traditional Kubernetes cluster in GKE (I didn't use core-dns which @alculquicondor advised me to use! But I'm going to test there next.

The TLDR is that I think there is something going on here relevant to creating a headless service and the indexed job pods, and if a batch user does that, we might design it in a way so we can be confident that the cluster-level network is ready to go when the pods are. This is likely not a "user controlled startup sequence" and more of an internal control, but it's really important and I wanted to bring it up.

@vsoch
Copy link
Contributor

vsoch commented May 2, 2023

Wow this is unbelievable! Here is core-dns (still with the extra service pod):

[14.259131908416748, 12.555635213851929, 12.192086219787598, 12.01100206375122, 10.97166109085083, 11.45758605003357, 11.914217233657837, 11.017939567565918, 11.805388689041138, 11.476716041564941, 11.758479118347168, 11.720443964004517, 11.042527914047241, 11.698939561843872, 11.32136607170105, 10.888871908187866, 11.059098958969116, 11.843417882919312, 11.530989170074463, 11.510135889053345]
Mean: 11.701781725883484
Std: 0.7439125645998699

and here is core-dns without it:

[16.587246417999268, 13.604760646820068, 15.725329160690308, 11.451491117477417, 15.68161916732788, 17.88189125061035, 16.28657817840576, 11.681580543518066, 11.986695528030396, 11.731800556182861, 11.154186725616455, 16.223593711853027, 15.519461393356323, 15.18950366973877, 16.399006128311157, 15.302581071853638, 12.93296504020691, 15.491178512573242, 16.46683144569397, 11.697750806808472]
Mean: 14.449802553653717
Std: 2.1635772298895435

so even with core-dns, we see a slowdown without the "warmup" pod (as I'm calling it!)

@alculquicondor
Copy link

Could you open a separate issue for networking on kubernetes/kubernetes? If possible, provide a repeatable scenario using just k8s APIs (service, Pod, Job), so that people that are not experts on batch, but are experts on networking, can reproduce.

this was done with a traditional Kubernetes cluster in GKE

There are 2 supported DNS stacks in GKE: kube-dns and clouddns. You shouldn't need core-dns. I suggested core-dns for your minikube experiments :)
https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--cluster-dns

@aojea
Copy link

aojea commented May 2, 2023

traditional Kubernetes cluster in GKE

there are many varieties of traditional 😄 , specially about DNS and pod network (DPv1 or DPv2)

is why the network is no longer ready to go, and before it was

Ok, we need more data here, is the IP connectivity that is slower or the DNS resolution?
which type of communication is exactly slower, from one pod to another, from one pod to a service, ...?

@vsoch
Copy link
Contributor

vsoch commented May 2, 2023

Could you open a separate issue for networking on kubernetes/kubernetes? If possible, provide a repeatable scenario using just k8s APIs (service, Pod, Job), so that people that are not experts on batch, but are experts on networking, can reproduce.

Sure, will add to my TODO! It's not at the top but I can do it in the next week or so.

There are 2 supported DNS stacks in GKE: kube-dns and clouddns. You shouldn't need core-dns. I suggested core-dns for your minikube experiments :)

Gotcha, but you mentioned the first is / is going to be deprecated? It seems to kind of be terrible, at least compared to the second!

Ok, we need more data here, is the IP connectivity that is slower or the DNS resolution?
which type of communication is exactly slower, from one pod to another, from one pod to a service, ...?

Basically from the pod, if you did a basic ping of the fully qualified domain name, when we have the service pod created beforehand (which is also added to the headless service on the cluster) the DNS is available sooner. We aren't doing a ping explicitly, we are using zeromq to make the connection, which will connect immediately if it's there (and fall back to a timeout to try again if not). Sorry I'm not experienced enough with DNS to answer your question, but if you can give me exactly the commands / example to test I can try to set this up in a reproducing example. Technically speaking, you could reproduce it with the Flux operator, but probably developers don't want to install an operator and then run a script or commands (unless they are open to that, in which case I can write that up - we already have the testing script!)

@vsoch
Copy link
Contributor

vsoch commented May 5, 2023

I wanted to update (for the network issue above) I ran some experiments last night that show the issue: https://twitter.com/vsoch/status/1654249315580387329 I still need to reproduce / make a dummy use case outside of the operator.

image

@aojea
Copy link

aojea commented May 5, 2023

Basically from the pod, if you did a basic ping of the fully qualified domain name, when we have the service pod created beforehand (which is also added to the headless service on the cluster) the DNS is available sooner.

If we can simplify the problem of the bootstrap to the minimum components involved it will be easier, let me see if I understand, the startup depends on a Service, and this Service has a single Pod, a Deployment?

The DNS or the ClusterIP will not be "available" until the Pods the Service references are running and ready, knowing the state of the EndpointSlices associated at each point in time will you this information, since the DNS and the kube-proxy depend on them to provide the connectivity

@vsoch
Copy link
Contributor

vsoch commented May 5, 2023

The setup for the operator creates:

  • an indexed job
  • a few config maps
  • a headless service

and before we had a one off service pod (to generate an asset for the config map) that was:

  • a pod
  • an indexed job
  • a few config maps
  • a headless service

For that configuration, the network was always ready by the time the broker (that needs to see the other pods in the indexed job) came up. When we removed that (the first configuration above). The network was never ready. When I emulated having this pod (the experiment plot above) it reproduced the error - so I think it's the general extra pod that is doing something.

What I'll try when I get a chance is to make a very bare bones operator that reproduces the above, and ideally I can delete components from the flux operator until (maybe?) the pattern I see above goes away, but not sure, haven't tried it yet! I'll definitely update here when I get a chance. The above has a linked repository (with the Flux Operator) that technically reproduces the issue (and I wrote out instructions) but ideally if I open an issue somewhere I can derive a much more minimal reproducing case.

@aojea
Copy link

aojea commented May 5, 2023

a headless service

this is the one that is going to generate the DNS records for your services, so what pods match this service selector?

@vsoch
Copy link
Contributor

vsoch commented May 5, 2023

this is the one that is going to generate the DNS records for your services, so what pods match this service selector?

For the original certificate generator, it was only the indexed job pods that matched - not the certificate generator pod. For the testing cases above (which I used to reproduce a similar case in the operator because you can specify a sidecar service container on the same network) that pod is also part of the same headless service.

@alculquicondor
Copy link

What is the order of creation? Is it the following?

  1. service
  2. (exists in one case, but not the other) random-pod, matches the service
  3. indexed job, matches the service.

At some point you are saying that you were creating a random service, but then it looks like you meant pod?

@alculquicondor
Copy link

I would suggest opening an issue in k/k, even if you don't have all the details. You can always post a minimal reproducer later.
Let's stop derailing this thread about job startup sequence :)

@vsoch
Copy link
Contributor

vsoch commented May 5, 2023

Sorry @alculquicondor ! 😆 I can definitely do that - I brought it up originally because it does seem somewhat related, and have continued to share updates because it's a pretty cool bug :)You are right it's not totally in the right place - I will find another spot if/when I can follow up.

@kannon92
Copy link
Contributor

kannon92 commented Jul 19, 2023

To go back to topic at hand,

Could we follow a similar API as we have for SuccessPolicy or FailurePolicy?

type StartupPolicy struct {
	// Operator determines either All or Any of the selected jobs should be ready to consider the replicatedjob ready
	// +kubebuilder:validation:Enum=All;Any
	Operator Operator `json:"operator"`

	// TargetReplicatedJobs are the names of the replicated jobs the operator will apply to.
	// A null or empty list will apply to all replicatedJobs.
	TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`
}

I was thinking of assuming the TargetReplicatedJobs list corresponds to the order of replicated jobs we want to be ready?

One question I have is would we want the other jobs to be started in a suspended state and unsuspend them once startupPolicy is successful

@kannon92
Copy link
Contributor

kannon92 commented Aug 3, 2023

I wanted to take some of the ideas we discuss and consolidate into a KEP.

/assign

Not sure I'll be able to take the implementation but wanted to draft a potential API and see!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024
@ahg-g
Copy link
Contributor Author

ahg-g commented Jan 26, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024
@danielvegamyhre
Copy link
Contributor

Completed in #246

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants