Remove s6-overlay and NMA sidecar annotation #689

spilchen · 2024-02-01T20:05:17Z

This removes from the v2 server container the s6-overlay init process. Now that we have full support for running the NMA in a sidecar, this systemd-style process isn't needed. One advantage of this change is it allows us run in OpenShift with the restricted SCC. This makes it easier to deploy Vertica in OpenShift because this is the default SCC. This will be available for consumption in the v24.2.0 server to be released in April 2024.

Also removed in this change is the vertica.com/run-nma-in-sidecar annotation. This previously was needed to correctly setup the pod spec. But the operator can automatically determine this. It defaults to using the NMA sidecar for all vclusterOps deployments. If it finds itself on 24.1.0, then it will adjust the statefulset so that the NMA and server are run in a single monolithic container.

This also will address different use cases as it relates to NMA deployment and the Vertica version.

If vclusterOps is requested on an image prior to 24.1.0, then the container will not start. Both the NMA container, which starts node_management_agent directly, and the server container, which starts vertica with the --startup-conf setting, will fail immediately. We cannot run anything in these pods to even get the version. We added a new reconciler in the vdb controller that will check if the pod is in a crash loop backoff. And if so it will log an event message under the assumption its because we are trying to run vclusterOps on an old image.
If vclusterOps is used with a 24.1.0 image, the server container won't start because it is missing the --startup-conf setting that is new in 24.2.0. The pod facts collection will now use the NMA container if present for these cases.

roypaulin

Looks good!

roypaulin · 2024-02-02T12:39:48Z

pkg/controllers/vdb/crashloop_reconciler.go

+		// This reconciler is a best effort. It only tries to surface meaningful
+		// error messages based on the events it see. For this reason, no errors
+		// are emitted. We will log them then carry on to the next reconciler.
+		c.Log.Info("Failure detecting in CrashLoopReconciler. Will continue on", "err", err)


Can you word this better?

roypaulin · 2024-02-02T12:48:28Z

pkg/controllers/vdb/crashloop_reconciler.go

+			if nmaStatus == nil {
+				continue
+			}
+			if nmaStatus.RestartCount > 0 &&


This logical expression is quite complex. Can you add a comment breaking down what's going on?

roypaulin · 2024-02-02T12:51:33Z

pkg/controllers/vdb/podfacts.go

@@ -161,6 +161,12 @@ type PodFact struct {

 	// The in-container path to the catalog. e.g. /catalog/vertdb/v_node0001_catalog
 	catalogPath string
+
+	// true if the pod's spec includes a sidecar to run the NMA
+	hasNMASidecar bool


Suggested change

hasNMASidecar bool

hasNMASideCar bool

I'd prefer to keep the original name since we should only camel case on word boundaries (i.e. the word is "sidecar" not "side car")

roypaulin · 2024-02-02T12:52:47Z

pkg/controllers/vdb/podfacts.go

+// deployments because that will guaranteed a running container. We cannot use
+// the server incase the statefulset is (temporarily) setup for NMA sidecar but
+// can't support that. It's a classic chicken-in-egg situation where we don't
+// know the proper pod spec until we know the Vertica version, but you don't the


Suggested change

// know the proper pod spec until we know the Vertica version, but you don't the

// know the proper pod spec until we know the Vertica version, but you don't know the

roypaulin · 2024-02-02T13:13:14Z

pkg/events/event.go

@@ -83,7 +83,7 @@ const (
 	MgmtFailedDiskFull              = "MgmtFailedDiskfull"
 	LowLocalDataAvailSpace          = "LowLocalDataAvailSpace"
 	WrongImage                      = "WrongImage"
-	NMAInSidecarNotSupported        = "NMAInSidecarNotSupported"
+	NMADeploymentIncompatibilty     = "NMADeploymentIncompatibility"


Is it used anywhere?

No. I will remove it.

roypaulin · 2024-02-02T13:20:13Z

Why did you move some tests?

spilchen · 2024-02-02T21:38:02Z

Why did you move some tests?

When you start a 24.1.0 server, we initially deploy with the assumption that there is a NMA sidecar. It isn't until the pods are running that we can check the version. So, we then tear down the statefulset and rebuild it without a sidecar. There was one test (mount-certs) that was sensitive to this. It would run a kubectl exec in the server container, which may or may not be running. I could have added more checks into that e2e test I think. But at the time, I opted to move it to leg 7, which runs on 24.2.0 only. Leg 3 was getting light on tests, so I then moved another test into it.

roypaulin · 2024-02-02T22:20:57Z

pkg/controllers/vdb/crashloop_reconciler.go

+	}
+}
+
+func (c *CrashLoopReconciler) findNMAContainerStatus(pod *corev1.Pod) *corev1.ContainerStatus {


I see around some functions related to k8s, so I wonder if at some point a separate package will be needed, especially to avoid code duplication.

Yes, we will have to keep an eye on it. I already moved this function out to the bulder package (even before I saw this) to use the same logic somewhere else.

spilchen · 2024-02-03T01:29:14Z

I hit an issue upgrading from 23.4 to 24.2. So, there are few more changes to upgrade to handle that. I also added a new e2e leg (8) that does this upgrade path to verify this change.

Matt Spilchen added 7 commits January 31, 2024 10:49

Container updates and NMA sidecar blocking

3b4d494

Remove run-nma-as-sidecar flag

35fcabc

Merge branch 'main' into spilchen/nma-container-change

6d54214

Add new reconciler

899b8b8

Fix e2e

3f1e5f1

Backout podfacts change

a946953

Add changie

a7acb5c

spilchen requested a review from roypaulin February 1, 2024 20:05

spilchen self-assigned this Feb 1, 2024

spilchen changed the title ~~Spilchen/nma container change~~ Remove s6-overlay and NMA sidecar annotation Feb 1, 2024

roypaulin reviewed Feb 2, 2024

View reviewed changes

Fix upgrade through 24.1.0

e7a8e3e

roypaulin reviewed Feb 2, 2024

View reviewed changes

Matt Spilchen added 4 commits February 2, 2024 19:47

Fix e2e tests

1ca2d50

Apply review comments

22390e8

Fix e2e

5557cf6

Fix e2e

ece1da9

Detect crashloop for old vertica versions with nma

bdc114e

spilchen requested a review from roypaulin February 5, 2024 12:34

roypaulin approved these changes Feb 5, 2024

View reviewed changes

spilchen merged commit 2d84648 into main Feb 5, 2024
30 checks passed

spilchen deleted the spilchen/nma-container-change branch February 5, 2024 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove s6-overlay and NMA sidecar annotation #689

Remove s6-overlay and NMA sidecar annotation #689

spilchen commented Feb 1, 2024

roypaulin left a comment

roypaulin Feb 2, 2024

roypaulin Feb 2, 2024

roypaulin Feb 2, 2024

spilchen Feb 2, 2024

roypaulin Feb 2, 2024

roypaulin Feb 2, 2024

spilchen Feb 2, 2024

roypaulin commented Feb 2, 2024

spilchen commented Feb 2, 2024

roypaulin Feb 2, 2024

spilchen Feb 3, 2024

spilchen commented Feb 3, 2024

	// know the proper pod spec until we know the Vertica version, but you don't the
	// know the proper pod spec until we know the Vertica version, but you don't know the

Remove s6-overlay and NMA sidecar annotation #689

Remove s6-overlay and NMA sidecar annotation #689

Conversation

spilchen commented Feb 1, 2024

roypaulin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roypaulin commented Feb 2, 2024

spilchen commented Feb 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spilchen commented Feb 3, 2024