WIP: Added /-/healthy and /-/ready endpoints to all thanos components #656

FUSAKLA · 2018-12-03T00:03:38Z

Signed-off-by: Martin Chodur m.chodur@seznam.cz
Closes #644

Addition of liveness and readiness endpoints to all components.
Added package prober which olds information if the component is healthy and ready.
It can be registered to Mux or Router so that it can be used in the metricHTTPListenGroup in components which does not have own UI or in components that has own routing.

store
In the store the initial loading of cache was blocking start of the HTTP server thus it couldnt expose the liveness check. Because of that the initial cache update was moved to the g as an actor and only readiness of the thanos-store is set to true when the cache is updated for the first time.

receive
As discussed on slack the receive had two http interfaces which were merged together when adding the prober.
As a side effect it resolves #959

Verification

Tests are passing and it was tested on started every tjanos component type.

adrien-f · 2018-12-03T10:30:19Z

Awesome ! It's true that I only added the healthy route to the Querier, that was selfish 😄 !

bwplotka

Thanks!

Ok, this is quite confusing as there are 2 types of healthchecks ready and liveness. In my opinion we should add both if we want to NOT confuse users. Similar to what Prometheus did here:
prometheus/pushgateway#135
and this discussion: prometheus/pushgateway#105

So basically we need /-/ready and /-/healthy, currently for most components it would be same handling (serve it in http metric server), however for store it needs to be more complex:

liveness (healthy) needs to run from the beginning
readiness (ready) needs to be OK only once we synced all meta files and start serving gRPC requests.

What do you think?

cmd/thanos/main.go

kube/manifests/thanos-store.yaml

FUSAKLA · 2018-12-09T23:08:14Z

Again sorry for delay. Thank you for all the comments.

Adding also the /-/ready endpoint would be great.
I'll correct the /-/healthy endpoint not to be blocked by any startup operations.

Regarding the readiness probe:

Rule: Not aware of any other condition we should wait till saying it's ready? Maybe gRPC serving also?
Store: Wait until synced all meta files and start serving gRPC .
Query: Possibly wait for getHealthyStores ? but maybe should be ready as soon as the UI works.
Compactor: Not sure if it should even have readiness probe since it does not even have any API.
Sidecar: Here we could check the promUp for the ready state since blocks shipping would continue even when the component is not ready?

If we agree on the correct way to check for the ready state in every component I'd be happy to add it.

FUSAKLA · 2018-12-10T23:31:44Z

So I refactored it to match (I hope) my suggestions in previous comment.
Now every component should have /-/healthy and /-/ready endpoint and hopefully the readiness probe should honor state when all API endpoints of the component are ready to be served HTTP and gRPC (or just one of them if other is not present.) The liveness healthy probe should be served soon after start before any blocking operations.

Would something like that be acceptable? I'd add some tests but those run<Component> functions are huge and I'd have to split out the HTTP server initialization which would be bigger change to test those instrumentation endpoints without spinning up the whole node.

cmd/thanos/rule.go

adrien-f · 2018-12-31T13:37:50Z

It feels weird to have ̀registerHealthyandregisterReady` and only using it for the sidecar and the store components while reimplemeting it in ruler and querier. Could these not be used everywhere ?

bwplotka

Ok, so the direction is very nice, but some suggestions.

The major problem is that we have race issues here. Remember that in Golang almost nothing is atomic (thread safe) out of the box. Not even boolean value. We need to change our code to be concurrent safe here as we set/read xxxIsReady from different go routines.

We could wrap it with lock or sync.Atomic but actually we can build design something nicer, with the suggestion @adrien-f gave: To have generic method for this everywhere.

Let's focus what is generic. Generic is:

registration. It is always on the same path but either on mux or router.
Readiness and Healthiness handling based on IsHealthy or IsReady` methods.

This allows us to define struct that everyone will use with following methods:

type Prober struct {
   readyMtx sync.Mutex
   readiness error
   healthyMtx sync.Mutex
   healthiness error
}

func NewProbeInRouter(..) *Prober
func NewProbeInMux(..) *Prober

func (p *Prober) IsReady() error {
  p.readyMtx.Lock()
  defer p.readyMtx.Unlock()
  return p.readiness
}

func (p *Prober) Ready() {
  p.NotReady(nil)
}

func (p *Prober) NotReady(err error) {
  p.readyMtx.Lock()
  defer p.readyMtx.Unlock()
  p.readiness = err
}

// etc...

What do you think? (:

CHANGELOG.md

cmd/thanos/compact.go

cmd/thanos/main.go

cmd/thanos/query.go

cmd/thanos/rule.go

cmd/thanos/sidecar.go

FUSAKLA · 2019-01-13T02:21:45Z

Sorry about the delay I cannot find the time to finish this.
(I resolved all the comments without commenting because the code was completely rewritten, sorry for that)

Thank you so much for all the comments both of you. @bwplotka your suggestions on implementation were great so I implemented it as you suggested. At least I hope I understood you well :)

The Prober should be covered with tests and there is still test for basic http endpoints in the main_test.go. All are passing and I tried building and running all the components and this is also OK.
Nicely working for example query node without configuration where it's never ready and returning error on calling prometheus API.

I'd be glad to discuss more if the points where I'm setting nodes ready and healthy are ok and should be added any more or moved possibly.

Thanks for all the advises!

pkg/prober/prober.go

bwplotka · 2019-03-18T15:58:34Z

@FUSAKLA can we back to this? Rebase & and change title of PR to reflect changes? I think this is hitting us more recently (:

CC @SuperQ

FUSAKLA · 2019-03-18T16:09:06Z

ouch.. yep I'll take a look and re-base it so we can finish this off

bwplotka

Thanks for this, but I am still seeing;

non resolved comments
readiness used in places where healthyness should be used? Lot's of inconsistencies IMO

cmd/thanos/compact.go

cmd/thanos/query.go

bwplotka · 2019-04-15T11:58:21Z

cmd/thanos/sidecar.go

@@ -133,8 +140,9 @@ func runSidecar(
 						"msg", "failed to fetch initial external labels. Is Prometheus running? Retrying",
 						"err", err,
 					)
+					readinessProber.SetNotReady(err)


So... because we use metricHTTPListenGroup its Ready, and then suddenly no ready here? I think it's quite nasty race.. As being marked rdy, and then suddenly not, means that container will be restarted, however we have retry here.

Hm, in this case it is bit unfortunate that's true.
Being marked not ready does not cause restart of the container that would cause being not healthy. But it could cause requests being sent to the sidecar even when hasn't yet fetched the external labels.

I'll leave just the readinessProber.SetHealthy() set the readiness outside of the metricHTTPListenGroup depending on each component.

Moved away setting the ready status from the default http listener
5e9a4c4

bwplotka · 2019-04-15T11:59:56Z

cmd/thanos/sidecar.go

@@ -172,32 +180,34 @@ func runSidecar(
 				if err := m.UpdateLabels(iterCtx, logger); err != nil {
 					level.Warn(logger).Log("msg", "heartbeat failed", "err", err)
 					promUp.Set(0)
+					readinessProber.SetNotReady(err)


Isn't this healthyness?

I wouldn't say so. You don't want to get restarted when Prometheus just doesn't respond for the external labels query or do you?

bwplotka · 2019-04-15T12:01:06Z

cmd/thanos/sidecar.go

 				} else {
 					// Update gossip.
 					peer.SetLabels(m.LabelsPB())

 					promUp.Set(1)
+					readinessProber.SetReady()


bwplotka · 2019-04-15T12:01:42Z

cmd/thanos/sidecar.go

 			return errors.Wrap(s.Serve(l), "serve gRPC")
-		}, func(error) {
+		}, func(err error) {
+			readinessProber.SetNotReady(err)


I mean, setting in one function like this is enough

also healthyness

I'm not sure if I understand correctly what exactly do you mean by the setting in one function.
You mean dropping at all changing the ready status because of prom ext labels fetch?

Also not sure about the readiness vs healthyness. The sidecar in this case could be still shipping some buckets to OS so killing it just because the gRPC interface has malfunction could be too harsh?

bwplotka · 2019-04-15T12:08:30Z

@FUSAKLA

I personally don't like the Store liveness blocked by bucket init otherwise I'd say it's ok?
I'd be glad for any suggestions, thanks!

Let's fix this in later PR. It's not trivial

ready: Once gRPC starts listening (can change same as prom_up metric)

This is tricky. Why readiness fails? Not liveness?

Also wonder if that is not too flaky.. but let's say it's ok

brancz

Generally this is looking pretty good, but a lot of behavior and it feels easy to miss something, but I think we can move forward with this, but I'd be good if @bwplotka can make a final call.

cmd/thanos/main_test.go

test/e2e/spinup_test.go

xjewer · 2019-05-10T12:31:56Z

fixes #532

FUSAKLA · 2019-05-12T07:22:41Z

Rebased on master

GiedriusS · 2019-05-27T14:25:47Z

pkg/prober/prober.go

+	return prober
+}
+
+// HandleInMux registers readiness and liveness probes to mux.


This comment seems off. The method is called RegisterInRouter.

Thanks, it was leftover after refactoring.

GiedriusS · 2019-05-27T14:29:07Z

pkg/prober/prober.go

+			f(w, r)
+			return
+		}
+		p.writeResponse(w, p.IsReady, "ready")


There's a small error here. By the time you call this. p.IsReady() might start indicating that it is suddenly ready, right? AFAICT you need to do both of these actions while p.readyMtx is locked.

Good point! The chances are really small but still this is a race. Thanks!

PTAL if this way it's ok with you

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>

…t happy

FUSAKLA · 2019-06-10T21:16:55Z

Just to be clear, we agreed with @bwplotka that I'll split this PR to multiple smaller ones because this is too big to review and could cause various issues regarding number of changes in behavior.
I'll change this to a draft or wip or something like this until I split it up

Thanks Bartek for making the call 👍

EDIT: Changing from PR to draft back is not possible unfortunately so added WIP: to the name for now.

FUSAKLA · 2019-07-02T06:05:13Z

#1297

bwplotka · 2019-09-17T17:06:55Z

This was splitted into smaller PRs by @FUSAKLA Thanks for this! I think this means we can close this one? (:

FUSAKLA · 2019-09-17T17:20:35Z

Yes definitely to avoid confusion, thanks 👍

FUSAKLA force-pushed the fus-add-health-endpoint branch from 98785a9 to f0fbd10 Compare December 3, 2018 00:22

bwplotka requested changes Dec 3, 2018

View reviewed changes

cmd/thanos/main.go Outdated Show resolved Hide resolved

cmd/thanos/main.go Outdated Show resolved Hide resolved

kube/manifests/thanos-store.yaml Outdated Show resolved Hide resolved

FUSAKLA force-pushed the fus-add-health-endpoint branch 2 times, most recently from d6e4b37 to 16fd343 Compare December 10, 2018 23:26

adrien-f reviewed Dec 31, 2018

View reviewed changes

cmd/thanos/rule.go Outdated Show resolved Hide resolved

bwplotka requested changes Dec 31, 2018

View reviewed changes

FUSAKLA force-pushed the fus-add-health-endpoint branch 2 times, most recently from b103c49 to f518ced Compare January 13, 2019 02:12

FUSAKLA force-pushed the fus-add-health-endpoint branch from f518ced to 32d55d7 Compare January 13, 2019 02:28

FUSAKLA commented Jan 13, 2019

View reviewed changes

pkg/prober/prober.go Outdated Show resolved Hide resolved

FUSAKLA force-pushed the fus-add-health-endpoint branch 2 times, most recently from 9680def to 05b7eb2 Compare January 13, 2019 16:49

bwplotka added the priority: P1 label Jan 29, 2019

domgreen added the state: changes-requested label Feb 8, 2019

bwplotka added state: in-review and removed state: changes-requested labels Feb 28, 2019

FUSAKLA force-pushed the fus-add-health-endpoint branch from 05b7eb2 to d505781 Compare March 22, 2019 23:25

FUSAKLA force-pushed the fus-add-health-endpoint branch from a8d9a27 to 059759c Compare April 15, 2019 10:24

bwplotka requested changes Apr 15, 2019

View reviewed changes

FUSAKLA force-pushed the fus-add-health-endpoint branch from 059759c to 0de1bba Compare April 22, 2019 13:27

brancz reviewed Apr 30, 2019

View reviewed changes

cmd/thanos/main_test.go Show resolved Hide resolved

test/e2e/spinup_test.go Show resolved Hide resolved

FUSAKLA force-pushed the fus-add-health-endpoint branch from 5e9a4c4 to b51c5a5 Compare May 12, 2019 07:18

FUSAKLA force-pushed the fus-add-health-endpoint branch from b51c5a5 to 55ab8a4 Compare May 26, 2019 21:06

GiedriusS reviewed May 27, 2019

View reviewed changes

Martin Chodur and others added 7 commits June 10, 2019 23:04

feat: added /-/healthy and /-/ready endppoints

95b3864

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>

refactor: changed metricHTTPListenGroup to defaultHTTPListener

00209e6

fix: fixed mtx deadlock in prober

6f1c60a

reafactor: changed defaultHTTPListener prober reference to make go ve…

6b5a4be

…t happy

fix: move setting prober ready away from default http listener

12601e9

feat: added prober tests to defaultHTTPListener

e14cb5a

fix prober: fixed race condition when writing response

c5b37f7

FUSAKLA changed the title ~~Added /-/healthy and /-/ready endpoints to all thanos components~~ WIP: Added /-/healthy and /-/ready endpoints to all thanos components Jun 10, 2019

rebased on master

b12f2d6

FUSAKLA force-pushed the fus-add-health-endpoint branch from 0bde2b7 to b12f2d6 Compare June 10, 2019 21:30

FUSAKLA mentioned this pull request Jun 10, 2019

feat: added prober package to generalize readiness probes #1242

Merged

bwplotka removed the state: in-review label Jun 28, 2019

kakkoyun mentioned this pull request Sep 17, 2019

feat store: added readiness and livenes prober #1460

Merged

1 task

bwplotka closed this Sep 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Added /-/healthy and /-/ready endpoints to all thanos components #656

WIP: Added /-/healthy and /-/ready endpoints to all thanos components #656

FUSAKLA commented Dec 3, 2018 •

edited

Loading

adrien-f commented Dec 3, 2018

bwplotka left a comment

FUSAKLA commented Dec 9, 2018

FUSAKLA commented Dec 10, 2018

adrien-f commented Dec 31, 2018

bwplotka left a comment

FUSAKLA commented Jan 13, 2019

bwplotka commented Mar 18, 2019

FUSAKLA commented Mar 18, 2019

bwplotka left a comment

bwplotka Apr 15, 2019

FUSAKLA Apr 22, 2019 •

edited

Loading

FUSAKLA Apr 22, 2019

bwplotka Apr 15, 2019

FUSAKLA Apr 22, 2019

bwplotka Apr 15, 2019

bwplotka Apr 15, 2019

bwplotka Apr 15, 2019

FUSAKLA Apr 22, 2019

bwplotka commented Apr 15, 2019 •

edited

Loading

brancz left a comment

xjewer commented May 10, 2019 •

edited

Loading

FUSAKLA commented May 12, 2019

GiedriusS May 27, 2019

FUSAKLA May 27, 2019

GiedriusS May 27, 2019

FUSAKLA May 27, 2019

FUSAKLA May 27, 2019

FUSAKLA commented Jun 10, 2019 •

edited

Loading

FUSAKLA commented Jul 2, 2019

bwplotka commented Sep 17, 2019

FUSAKLA commented Sep 17, 2019

WIP: Added /-/healthy and /-/ready endpoints to all thanos components #656

WIP: Added /-/healthy and /-/ready endpoints to all thanos components #656

Conversation

FUSAKLA commented Dec 3, 2018 • edited Loading

Verification

adrien-f commented Dec 3, 2018

bwplotka left a comment

Choose a reason for hiding this comment

FUSAKLA commented Dec 9, 2018

FUSAKLA commented Dec 10, 2018

adrien-f commented Dec 31, 2018

bwplotka left a comment

Choose a reason for hiding this comment

FUSAKLA commented Jan 13, 2019

bwplotka commented Mar 18, 2019

FUSAKLA commented Mar 18, 2019

bwplotka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FUSAKLA Apr 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwplotka commented Apr 15, 2019 • edited Loading

brancz left a comment

Choose a reason for hiding this comment

xjewer commented May 10, 2019 • edited Loading

FUSAKLA commented May 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FUSAKLA commented Jun 10, 2019 • edited Loading

FUSAKLA commented Jul 2, 2019

bwplotka commented Sep 17, 2019

FUSAKLA commented Sep 17, 2019

FUSAKLA commented Dec 3, 2018 •

edited

Loading

FUSAKLA Apr 22, 2019 •

edited

Loading

bwplotka commented Apr 15, 2019 •

edited

Loading

xjewer commented May 10, 2019 •

edited

Loading

FUSAKLA commented Jun 10, 2019 •

edited

Loading