client: add support for checks in nomad services #13715

shoenig · 2022-07-12T14:32:55Z

This PR adds support for specifying checks in services registered to
the built-in nomad service provider.

Currently only HTTP and TCP checks are supported, though more types
could be added later.

Future Work https://github.com/hashicorp/team-nomad/issues/354

docs & e2e in a follow up PR

An example job file to play around with

job "fake" {
  datacenters = ["dc1"]

  group "fake" {

    network {
      mode = "bridge"
      port "http" { to = 9090 }
    }

    service {
      provider = "nomad"
      name     = "fake1"
      port     = "http"
      check {
        type     = "http"
        path     = "/"
        interval = "5s"
        timeout  = "1s"
      }
    }

    task "faketask" {
      driver = "docker"

      config {
        image = "nicholasjackson/fake-service:v0.23.1"
        ports = ["http"]
      }

      env {
        LISTEN_ADDR = "0.0.0.0:9090"
      }

      resources {
        cpu    = 10
        memory = 32
      }
    }
  }

  group "caching" {
    network {
      mode = "bridge"
      port "db" { to = 6379 }
    }

    service {
      provider = "nomad"
      name = "redis"
      port = "db"
      check {
	name = "redis_tcp"
	type = "tcp"
	interval = "10s"
	timeout = "1s"
      }
    }

   task "redis" {
      driver = "docker"

      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

Example query to the API

➜ nomad operator api /v1/allocation/fb261663-507f-95f1-ae02-7eaf14ea8998/checks  | jq .
{
  "3a6ceb54262f99cf87598b05bcb91af8": {
    "Check": "service: \"fake1\" check",
    "Group": "fake.fake[0]",
    "ID": "3a6ceb54262f99cf87598b05bcb91af8",
    "Mode": "healthiness",
    "Output": "nomad: http ok",
    "Service": "fake1",
    "Status": "success",
    "StatusCode": 200,
    "Timestamp": 1657663765
  }
}

This PR adds support for specifying checks in services registered to the built-in nomad service provider. Currently only HTTP and TCP checks are supported, though more types could be added later.

jrasell

This looks awesome! No blockers and mostly just questions for my education.

jrasell · 2022-07-19T07:47:42Z

client/allocrunner/checks_hook.go

+			result := o.checker.Do(o.ctx, o.qc, query)
+
+			// and put the results into the store
+			_ = o.checkStore.Set(o.allocID, result)


Would it be worth adding a comment why it's safe to ignore this error? shim.Set calls db.PutCheckResult, which depending on the implementation has potential to return an error.

Good catch! added a missing error log statement if the shim is unable to set the check status in the persistent store

jrasell · 2022-07-19T07:57:46Z

client/allocrunner/checks_hook.go

+	if err := h.shim.Purge(h.allocID); err != nil {
+		h.logger.Error("failed to purge check results", "alloc_id", h.allocID, "error", err)
+	}


Is there anything the operators can do if this log line is seen?

Not really, no. PreKill doesn't report an error either so it's not like we can prevent the client from continuing with purging the alloc - though presumably if the state store can't remove a check, it can't remove anything else, either

jrasell · 2022-07-19T08:24:00Z

client/serviceregistration/checks/checkstore/shim.go

+	results, err := s.db.GetCheckResults()
+	if err != nil {
+		s.log.Error("failed to restore health check results", "error", err)
+		return
+	}


Am I right in thinking that we log the error rather than return it so that the client doesn't fail to start due to a problem in restoring the check results and that results will be re-populated after the next subsequent trigger?

It might be useful to have a comment describing the behaviour irregardless of whether my statement is correct.

jrasell · 2022-07-19T08:31:37Z

client/serviceregistration/checks/checkstore/shim.go

+	m, exists := s.current[allocID]
+	if !exists {
+		return nil
+	}
+
+	return helper.CopyMap(m)


Noting that helper.CopyMap handles nil maps so we could do away with the exists check, however, this does read better.

jrasell · 2022-07-19T08:41:33Z

client/serviceregistration/checks/client.go

+const (
+	// maxTimeoutHTTP is a fail-safe value for the HTTP client, ensuring a Nomad
+	// Client does not leak goroutines hanging on to unresponsive endpoints.
+	maxTimeoutHTTP = 10 * time.Minute


This seems somewhat high, but I don't have any idea how to choose a better value. Is there a particular reason it is set to 10 mins?

Yeah it's basically "much larger than a reasonable HC timeout" ... and "less than infinity". In my mind the slowest of checks should be on the order of a few seconds, e.g. incurring some database query

jrasell · 2022-07-19T08:45:01Z

client/serviceregistration/checks/client.go

+
+type checker struct {
+	log        hclog.Logger
+	clock      libtime.Clock


Am I correct that this wrapper around the standard lib is mostly used for testing capabilities? I just want to understand when is better to use this compared to calling the standard lib directly.

Yup, it's just for testing! At $prevJob we had excellent control over time in our code - making testing of time-based logic not just possible, but easy. I'd like to start trying to bringing some of those patterns into Nomad. (and really, using indirection over time.Now is 90% of the solution)

jrasell · 2022-07-19T09:34:40Z

nomad/structs/checks.go

+	// will not move forward while the check is failing.
+	Healthiness CheckMode = "healthiness"
+
+	// A Readiness check is useful in the context of ensuring a service is


Suggested change

// A Readiness check is useful in the context of ensuring a service is

// A Readiness check is useful in the context of ensuring a service

schmichael

Sorry for the partial review. Looking good so far: no functional issues. I'll pick it back up ASAP.

schmichael · 2022-07-19T23:59:36Z

client/allochealth/tracker.go

-	// l is used to lock shared fields listed below
-	l sync.Mutex
+	// lock is used to lock shared fields listed below
+	lock sync.Mutex


Hm, I'm a mu person myself, let's see..

$ rg -I '\W[a-z]+\W+sync.Mutex' | sed -e 's/var//' | awk '{print $1}' | sort | uniq -c | sort -nr 26 mu 16 lock 8 l 1 m 1 errmu 1 acquire

Seems like I'm still winning, but you're catching up!

But seriously anything is better than l so 👍

We'll let the people decide!

schmichael · 2022-07-20T00:13:15Z

client/allochealth/tracker.go

@@ -262,7 +308,7 @@ func (t *Tracker) watchTaskEvents() {
 		}

 		// Store the task states
-		t.l.Lock()
+		t.lock.Lock()
 		for task, state := range alloc.TaskStates {
 			//TODO(schmichael) for now skip unknown tasks as


//TODO(schmichael)

The scariest words I can see in a PR. I don't even think this comment is accurate. It seems to be copied and pasted from another place, but here we're iterating over alloc.TaskStates and updating a map that was originally populated from alloc.TaskStates ... I really don't see how group services could factor into this?

If you have 30 seconds of time to give this comment a think and remove it if you think it's nonsensical as well, I'd appreciate you cleaning up past schmichael's messes.

I don't see how taskHealth[task] could ever be !ok, but we don't have to worry about changing the actual code too.

Heh yeah I was wondering about this 😅

schmichael · 2022-07-20T00:13:38Z

client/allochealth/tracker.go

@@ -321,17 +367,12 @@ func (t *Tracker) watchTaskEvents() {
 			t.setTaskHealth(false, false)

 			// Avoid the timer from firing at the old start time


Suggested change

// Avoid the timer from firing at the old start time

// Prevent the timer from firing at the old start time

schmichael · 2022-07-20T00:18:54Z

client/allochealth/tracker.go

@@ -381,8 +455,12 @@ func (t *Tracker) watchConsulEvents() {
 OUTER:
 	for {
 		select {
+
+		// we are shutting down


This is dangerous phrasing. Is "we" the agent shutting down? I believe it just means this tracker is no longer needed and you don't need to know why. It could be canceled due to the alloc being stopped or another event causing the health to be set, but I don't think actually shutting down the agent closes it! So maybe:

Suggested change

// we are shutting down

// tracker has been canceled, no need to keep waiting

schmichael

Looks fantastic! Love all of the usecase specific types.

Changelog entry in this PR maybe?

schmichael · 2022-07-20T22:53:47Z

client/allochealth/tracker.go

+		case <-checkTicker.C:
+			results = t.checkStore.List(allocID)


Would be nice to add "blocking queries"/watching to checkStore so we could avoid polling here. No a big deal here, the cardinality will be low relative to the 500ms timer so the CPU savings would be meaningless. Might simplify testing by removing one timing dependent component.

schmichael · 2022-07-20T23:19:41Z

client/allocrunner/checks_hook_test.go

+	return alloc
+}
+
+var checkHandler = http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {


This is a fantastic testing approach!

schmichael · 2022-07-20T23:39:22Z

client/serviceregistration/checks/client.go

+	case "http":
+		qr = c.checkHTTP(timeout, qc, q)
+	default:
+		qr = c.checkTCP(timeout, qc, q)


It's a bit surprising to me that we use an interface for Checkers when there's only 1 concrete implementation that just switches between internal logic. No need to change it. The interface is still useful for testing, and we can always split this up in the future if we have so many check types the single struct becomes unwieldy.

In my head this was going to be super elegant with implementations per type (expanding in the future)... reality didn't quite get there yet 😞

schmichael · 2022-07-21T00:05:01Z

nomad/structs/services.go

+	}
+
+	// nomad checks do not have warnings
+	if sc.OnUpdate == "ignore_warnings" {


I wish we used more string consts... That s on the end would be easy to forget. Nothing we need to block this PR for though.

I'll do this in a followup PR

github-actions · 2022-12-23T02:15:39Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

vercel bot deployed to Preview – nomad-storybook-and-ui July 12, 2022 14:35 View deployment

shoenig force-pushed the dev-nsd-checks branch from db1607c to 0b5feb7 Compare July 12, 2022 14:36

vercel bot deployed to Preview – nomad-storybook-and-ui July 12, 2022 14:41 View deployment

shoenig force-pushed the dev-nsd-checks branch from 0b5feb7 to 0f10573 Compare July 12, 2022 18:34

vercel bot deployed to Preview – nomad-storybook-and-ui July 12, 2022 18:38 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui July 12, 2022 21:48 View deployment

shoenig force-pushed the dev-nsd-checks branch from 18b4561 to 246fa0b Compare July 12, 2022 21:59

vercel bot deployed to Preview – nomad-storybook-and-ui July 12, 2022 22:03 View deployment

client: add support for checks in nomad services

b2861f2

This PR adds support for specifying checks in services registered to the built-in nomad service provider. Currently only HTTP and TCP checks are supported, though more types could be added later.

shoenig force-pushed the dev-nsd-checks branch from 246fa0b to b2861f2 Compare July 12, 2022 22:09

vercel bot deployed to Preview – nomad-storybook-and-ui July 12, 2022 22:12 View deployment

shoenig marked this pull request as ready for review July 13, 2022 12:59

shoenig requested review from jrasell and schmichael July 13, 2022 12:59

jrasell approved these changes Jul 19, 2022

View reviewed changes

schmichael reviewed Jul 20, 2022

View reviewed changes

schmichael approved these changes Jul 21, 2022

View reviewed changes

client: updates from pr feedback

24dcd1d

vercel bot deployed to Preview – nomad-storybook-and-ui July 21, 2022 14:58 View deployment

shoenig merged commit 4508af8 into main Jul 21, 2022

shoenig deleted the dev-nsd-checks branch July 21, 2022 15:22

shoenig added this to the 1.4.0 milestone Aug 24, 2022

github-actions bot locked as resolved and limited conversation to collaborators Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: add support for checks in nomad services #13715

client: add support for checks in nomad services #13715

shoenig commented Jul 12, 2022 •

edited

Loading

jrasell left a comment

jrasell Jul 19, 2022

shoenig Jul 21, 2022

jrasell Jul 19, 2022

shoenig Jul 21, 2022

jrasell Jul 19, 2022

jrasell Jul 19, 2022

jrasell Jul 19, 2022

shoenig Jul 21, 2022 •

edited

Loading

jrasell Jul 19, 2022

shoenig Jul 21, 2022 •

edited

Loading

jrasell Jul 19, 2022

schmichael left a comment

schmichael Jul 19, 2022

shoenig Jul 21, 2022

schmichael Jul 20, 2022

shoenig Jul 21, 2022

schmichael Jul 20, 2022

schmichael Jul 20, 2022

schmichael left a comment

schmichael Jul 20, 2022

schmichael Jul 20, 2022

schmichael Jul 20, 2022

shoenig Jul 21, 2022

schmichael Jul 21, 2022

shoenig Jul 21, 2022

github-actions bot commented Dec 23, 2022

	// A Readiness check is useful in the context of ensuring a service is
	// A Readiness check is useful in the context of ensuring a service

		@@ -321,17 +367,12 @@ func (t *Tracker) watchTaskEvents() {
		t.setTaskHealth(false, false)

		// Avoid the timer from firing at the old start time

	// Avoid the timer from firing at the old start time
	// Prevent the timer from firing at the old start time

	// we are shutting down
	// tracker has been canceled, no need to keep waiting

client: add support for checks in nomad services #13715

client: add support for checks in nomad services #13715

Conversation

shoenig commented Jul 12, 2022 • edited Loading

jrasell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoenig Jul 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoenig Jul 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 23, 2022

shoenig commented Jul 12, 2022 •

edited

Loading

shoenig Jul 21, 2022 •

edited

Loading

shoenig Jul 21, 2022 •

edited

Loading