chaos testing + memberlist clustering changes + govendor -> dep #760

Dieterbe · 2017-11-13T20:20:50Z

needs a bit more work, mostly sharing for anyone who's curious what i'm up to.
see README.md and chaos_test.go for the highlevel overview

Dieterbe · 2017-11-23T00:39:16Z

observed problems so far:

often series will have multiple NaN's at the beginning or end.

data lag? -> added lag monitoring
for now ignoring few points at begin and end, so we can focus on clustering issues where series are completely incorrect

timeouts that i have to figure out further:

=== RUN   TestClusterStartup
--- PASS: TestClusterStartup (24.94s)
=== RUN   TestClusterBaseIngestWorkload
--- PASS: TestClusterBaseIngestWorkload (14.40s)
=== RUN   TestQueryWorkload
--- PASS: TestQueryWorkload (60.01s)
=== RUN   TestIsolateOneInstance
--- FAIL: TestIsolateOneInstance (60.00s)
        chaos_test.go:143: expected mt4 to return either correct or erroring responses. got (chaos.checkResults) {
                 Mutex: (sync.Mutex) {
                  state: (int32) 0,
                  sema: (uint32) 0
                 },
                 valid: ([]int) (len=2 cap=2) {
                  (int) 664, // correct responses
                  (int) 532 // 503 responses
                 },
                 empty: (int) 0,
                 timeout: (int) 4,
                 other: (int) 0,
                 firstOther: (*chaos.response)(<nil>)
                }

test suite detected 4 timeouts, in mt4 logs i see 2 requests that would have timed out the client.

$ docker logs dockerchaos_metrictank4_1 2>&1 | grep -c 'Completed /render.*503 Service Unavailable in 5.[0-9]*s'
532
$ docker logs dockerchaos_metrictank4_1 2>&1 | grep Completed | rv 'µs' | rv 'ms' | rv 'Completed /render.*503 Service Unavailable in 5\.[0-9]*s' | rv '200 OK.*[0-4]\.[0-9]*s'
[Macaron] 2017-11-23 00:12:43: Completed /render?target=sum(some.id.of.a.metric.*)&format=json&from=-15s 200 OK in 5.017100176s
[Macaron] 2017-11-23 00:13:05: Completed /render?target=sum(some.id.of.a.metric.*)&format=json&from=-15s 503 Service Unavailable in 53.436061928s
[Macaron] 2017-11-23 00:13:05: Completed /render?target=sum(some.id.of.a.metric.*)&format=json&from=-15s 503 Service Unavailable in 53.186765272s

maybe the completed lines were not logged? but:

~/g/s/g/g/m/chaos ❯❯❯ docker logs dockerchaos_metrictank4_1 2>&1 | grep -c 'Started GET /render'
2200
~/g/s/g/g/m/chaos ❯❯❯ docker logs dockerchaos_metrictank4_1 2>&1 | grep -c 'Completed /render'
2200

Dieterbe · 2017-11-23T01:35:41Z

one thing I've just learned is with current approach where we cut off all traffic from mt4 to the others, the others will remove mt4, and mt4 keeps getting messages from the others, but refutes them, and while it fails instances after ping timeouts, it doesn't seem to remove them from its list. so after the partition recovers, mt4 just talks to them again and the cluster is healed.
I will need to update this to make the failure work both ways

Dieterbe · 2017-11-29T18:09:35Z

I would like to merge this PR, mainly because of these particular generally useful commits:
0bb76bc
c80a8ed
7955734
7dd7c20

Note :

the vendor dir is currently a bit of a mess and scripts/vendor_health.sh complains and breaks the build. I will fix before merging (obviously)
2 commits have DO NOT MERGE in the message. these are only useful for chaos diagnosis and i'll remove those out of the branch that will ultimately be merged

If you're interested in running the chaos testing stuff, i've been doing it by running this command in the metrictank repo root:

(cd docker/docker-chaos && docker-compose down) && (cd chaos && go test -v -timeout 0)

(in my workflow I combine the output of the go tests with docker container logs and the included grafana dashboards in http://localhost:3000 )

woodsaj · 2017-11-29T18:30:26Z

api/middleware/tracer.go

@@ -49,6 +51,13 @@ func Tracer(tracer opentracing.Tracer) macaron.Handler {
 		macCtx.MapTo(macCtx.Resp, (*http.ResponseWriter)(nil))

 		rw := macCtx.Resp.(*TracingResponseWriter)
+
+		headers := macCtx.Resp.Header()
+		spanStr := span.(*jaeger.Span).String()


You can get the traceID with

span.Context().(jaeger.SpanContext).TraceId().String()

this saves you having to encode the spanId to a string, then split the traceId out from the span string.

it's like docker-cluster, but extra shardgroup and run MT in single tenant mode, running graphite in single tenant mode should set the right org-id header but for some reason it doesn't seem to be work

current versions like 4.6.0 have extra ssl/https checks our graphite has a self signed ssl cert, but not for the hostname 'graphitemon' that we give the container, so grafana would complain. I tried to use https://github.com/graphite-project/docker-graphite-statsd instead (which basically just requires switching the port to 80, and updating the config files) but i couldn't get any data into it, so gave up on that.

note: `out` is copied from fakemetrics. the reason we had to do that is because otherwise go compiler would complain about MetricData from schema package in MT's vendored dir vs that of fakemetrics not being the same. Hopefully we can clean that up later.

and remove redundant checks

seeing an uninitialized response could be confusing so use pointer, for which we use spew to show it nicely

This way we can use the latest grafana which has built-in annotations (see next commit) see also a7332fd carbon config is same as default, minus all the comments, and just these changes: -MAX_UPDATES_PER_SECOND = 500 -MAX_CREATES_PER_MINUTE = 50 +MAX_UPDATES_PER_SECOND = 5000 +MAX_CREATES_PER_MINUTE = 5000 -CACHE_WRITE_STRATEGY = sorted +CACHE_WRITE_STRATEGY = max

these settings are equivalent to default-local but with a 40s gossip-to-the-dead-time there's 2 reasons for this: * before this, you could tell via the cluster dashboard that the cluster was barely fast enough (and sometimes not all) to get the cluster in the desired state, e.g. during the isolation test. now it is more reliable though sometimes slightly too aggressive (sometimes a node that's supposed to be healthy is split off for a few seconds) but it's hard to find a balance. currently, mt4 feels split off for about 10s only (should be about 30s) whereas the others see mt4 split off for 25s. clearly this could use more tweaking * the gossip-to-the-dead-time allows for the cluster to reconverge to the full healthy cluster, when the network is healthy again previously, we didn't wait long enough and the cluster would stay split forever, after the testIsolate test.

note: dep ensure adds a bunch of stuff that govendor used to filter out: * test files * main files * non-go files * root and sub packages when we only need 1 specific package in a tree * other dependencies of root/sub packages that we didn't need in the first place but I then ran `dep prune` which cleans much of it up again vendor updates: package used by comments vendor/github.com/golang/snappy cmd/mt-replicator-via-tsdb google/go-querystring MT clustering minor github.com/hailocab/go-hostpool cassandra github.com/uber/jaeger-client-go tracing minor gopkg.in/raintank/schema.v0 everything minor v1.5

* fix two cases where Init failed but err=nil was reported and processing continued erroreously. * return annotated errors instead of bare * returned errors will be printed by caller, don't need to be printed twice. especially not as a warning when it's an error * typo fix

they are not reliable yet

even when it's no longer used....

Dieterbe · 2017-12-01T22:33:06Z

how does this look now? @woodsaj @replay . mergey mergey?

because sometimes it's not enough

Dieterbe · 2017-12-11T15:34:45Z

@DanCech mentioned another issue: bind-address is needlessly validated when it's not set, or something like that

because these fatal errors may happen any time after the test function has completed

the defaults of 30s are too long

DanCech · 2017-12-11T22:35:34Z

cluster/config.go

+	}
+
+	var err error
+	swimBindAddr, err = net.ResolveTCPAddr("tcp", swimBindAddrStr)


This is still being tested in every case though it's only used for manual, we should either only test it for manual config or use it for every config type

fixed in b49c5e8

Dieterbe force-pushed the chaos-pumba branch from 1f3a319 to a057440 Compare November 23, 2017 00:11

Dieterbe force-pushed the chaos-pumba branch from fcf443c to 3faea9e Compare November 26, 2017 22:30

Dieterbe mentioned this pull request Nov 26, 2017

try and reconnect to peers after they have left the cluster #766

Closed

Dieterbe force-pushed the chaos-pumba branch from 3faea9e to 95c8571 Compare November 26, 2017 23:15

Dieterbe requested review from woodsaj and jtlisi November 29, 2017 18:10

woodsaj reviewed Nov 29, 2017

View reviewed changes

Dieterbe changed the title ~~WIP chaos testing~~ chaos testing + memberlist clustering changes + govendor -> dep Nov 30, 2017

Dieterbe added 19 commits November 30, 2017 22:07

add docker-chaos environment

64a92ca

it's like docker-cluster, but extra shardgroup and run MT in single tenant mode, running graphite in single tenant mode should set the right org-id header but for some reason it doesn't seem to be work

basic chaos testing framework

773ab14

note: `out` is copied from fakemetrics. the reason we had to do that is because otherwise go compiler would complain about MetricData from schema package in MT's vendored dir vs that of fakemetrics not being the same. Hopefully we can clean that up later.

refactor/better

274342a

only isolate wrt specific ip's

736014c

test output already shows duration

738ebed

better way to troubleshoot

d4dc3ed

fix to proper http code

af77b8c

print results clearer

8de25e8

more useful classification and showing of invalid results

f408abf

and remove redundant checks

make tracker outputs discernable

bf7e1b9

fix query bug

c2350b4

display firstOther nicerly

2f237ce

seeing an uninitialized response could be confusing so use pointer, for which we use spew to show it nicely

clearer readme

d554dec

log tracing id in http response header

ea9f56f

better tests

7d374ba

track important events as grafana annotations

052205c

allow few nulls provided they don't get in the way of cluster testing

39c775a

Dieterbe added 4 commits November 30, 2017 22:07

support overriding SWIM (memberlist) performance tuneables

ae3c5da

set cluster bind address consistently via the same mechanism

81eea5f

add dependencies to vendor dir

c3323d1

Dieterbe force-pushed the chaos-pumba branch from 32cda07 to 268af9a Compare December 1, 2017 03:07

Dieterbe added 7 commits November 30, 2017 23:16

copy clock code so we don't have to pull in carbon-relay-ng

683d643

simplify traceID getting

41eabb1

set default for unit tests

6fc35e8

recent versions of go don't include vendor in ./...

d4ad9ae

don't automatically execute chaos tests yet

f1114c8

they are not reliable yet

Dieterbe force-pushed the chaos-pumba branch from 268af9a to f1114c8 Compare December 1, 2017 04:27

Dieterbe added 3 commits December 1, 2017 16:30

simplify

abea974

make go vet happy (don't copy lock)

493dc7a

even when it's no longer used....

bugfix: tracing may be disabled

153cc45

Dieterbe requested a review from replay December 4, 2017 14:17

Dieterbe added 4 commits December 5, 2017 15:43

auto cleanup docker stack first if needed

c05961c

cleanup

460ce94

increase gossip to dead time

53be9e5

because sometimes it's not enough

make use-config easier to use/understand

d54c9d4

Dieterbe added 4 commits December 11, 2017 16:48

only do multi-mode cluster checks/inits when mode is multi

ad70f89

test-independent goroutines should panic, not call test functions

e05ab2b

because these fatal errors may happen any time after the test function has completed

timeout kafka operations more aggressively

fa74973

the defaults of 30s are too long

helpful note

c2d2647

Dieterbe merged commit c2d2647 into master Dec 11, 2017

DanCech reviewed Dec 11, 2017

View reviewed changes

Dieterbe deleted the chaos-pumba branch December 15, 2017 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chaos testing + memberlist clustering changes + govendor -> dep #760

chaos testing + memberlist clustering changes + govendor -> dep #760

Dieterbe commented Nov 13, 2017

Dieterbe commented Nov 23, 2017 •

edited

Loading

Dieterbe commented Nov 23, 2017

Dieterbe commented Nov 29, 2017 •

edited

Loading

woodsaj Nov 29, 2017

Dieterbe commented Dec 1, 2017

Dieterbe commented Dec 11, 2017

DanCech Dec 11, 2017

Dieterbe Dec 13, 2017

chaos testing + memberlist clustering changes + govendor -> dep #760

chaos testing + memberlist clustering changes + govendor -> dep #760

Conversation

Dieterbe commented Nov 13, 2017

Dieterbe commented Nov 23, 2017 • edited Loading

observed problems so far:

Dieterbe commented Nov 23, 2017

Dieterbe commented Nov 29, 2017 • edited Loading

woodsaj Nov 29, 2017

Choose a reason for hiding this comment

Dieterbe commented Dec 1, 2017

Dieterbe commented Dec 11, 2017

DanCech Dec 11, 2017

Choose a reason for hiding this comment

Dieterbe Dec 13, 2017

Choose a reason for hiding this comment

Dieterbe commented Nov 23, 2017 •

edited

Loading

Dieterbe commented Nov 29, 2017 •

edited

Loading