Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

chaos testing + memberlist clustering changes + govendor -> dep #760

Merged
merged 53 commits into from
Dec 11, 2017

Conversation

Dieterbe
Copy link
Contributor

needs a bit more work, mostly sharing for anyone who's curious what i'm up to.
see README.md and chaos_test.go for the highlevel overview

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 23, 2017

observed problems so far:

  1. often series will have multiple NaN's at the beginning or end.
  • data lag? -> added lag monitoring
  • for now ignoring few points at begin and end, so we can focus on clustering issues where series are completely incorrect
  1. timeouts that i have to figure out further:
=== RUN   TestClusterStartup
--- PASS: TestClusterStartup (24.94s)
=== RUN   TestClusterBaseIngestWorkload
--- PASS: TestClusterBaseIngestWorkload (14.40s)
=== RUN   TestQueryWorkload
--- PASS: TestQueryWorkload (60.01s)
=== RUN   TestIsolateOneInstance
--- FAIL: TestIsolateOneInstance (60.00s)
        chaos_test.go:143: expected mt4 to return either correct or erroring responses. got (chaos.checkResults) {
                 Mutex: (sync.Mutex) {
                  state: (int32) 0,
                  sema: (uint32) 0
                 },
                 valid: ([]int) (len=2 cap=2) {
                  (int) 664, // correct responses
                  (int) 532 // 503 responses
                 },
                 empty: (int) 0,
                 timeout: (int) 4,
                 other: (int) 0,
                 firstOther: (*chaos.response)(<nil>)
                }

test suite detected 4 timeouts, in mt4 logs i see 2 requests that would have timed out the client.

$ docker logs dockerchaos_metrictank4_1 2>&1 | grep -c 'Completed /render.*503 Service Unavailable in 5.[0-9]*s'
532
$ docker logs dockerchaos_metrictank4_1 2>&1 | grep Completed | rv 'µs' | rv 'ms' | rv 'Completed /render.*503 Service Unavailable in 5\.[0-9]*s' | rv '200 OK.*[0-4]\.[0-9]*s'
[Macaron] 2017-11-23 00:12:43: Completed /render?target=sum(some.id.of.a.metric.*)&format=json&from=-15s 200 OK in 5.017100176s
[Macaron] 2017-11-23 00:13:05: Completed /render?target=sum(some.id.of.a.metric.*)&format=json&from=-15s 503 Service Unavailable in 53.436061928s
[Macaron] 2017-11-23 00:13:05: Completed /render?target=sum(some.id.of.a.metric.*)&format=json&from=-15s 503 Service Unavailable in 53.186765272s

maybe the completed lines were not logged? but:

~/g/s/g/g/m/chaos ❯❯❯ docker logs dockerchaos_metrictank4_1 2>&1 | grep -c 'Started GET /render'
2200
~/g/s/g/g/m/chaos ❯❯❯ docker logs dockerchaos_metrictank4_1 2>&1 | grep -c 'Completed /render'
2200

@Dieterbe
Copy link
Contributor Author

one thing I've just learned is with current approach where we cut off all traffic from mt4 to the others, the others will remove mt4, and mt4 keeps getting messages from the others, but refutes them, and while it fails instances after ping timeouts, it doesn't seem to remove them from its list. so after the partition recovers, mt4 just talks to them again and the cluster is healed.
I will need to update this to make the failure work both ways

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 29, 2017

I would like to merge this PR, mainly because of these particular generally useful commits:
0bb76bc
c80a8ed
7955734
7dd7c20

Note :

  • the vendor dir is currently a bit of a mess and scripts/vendor_health.sh complains and breaks the build. I will fix before merging (obviously)
  • 2 commits have DO NOT MERGE in the message. these are only useful for chaos diagnosis and i'll remove those out of the branch that will ultimately be merged

If you're interested in running the chaos testing stuff, i've been doing it by running this command in the metrictank repo root:

(cd docker/docker-chaos && docker-compose down) && (cd chaos && go test -v -timeout 0)

(in my workflow I combine the output of the go tests with docker container logs and the included grafana dashboards in http://localhost:3000 )

@@ -49,6 +51,13 @@ func Tracer(tracer opentracing.Tracer) macaron.Handler {
macCtx.MapTo(macCtx.Resp, (*http.ResponseWriter)(nil))

rw := macCtx.Resp.(*TracingResponseWriter)

headers := macCtx.Resp.Header()
spanStr := span.(*jaeger.Span).String()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get the traceID with

span.Context().(jaeger.SpanContext).TraceId().String()

this saves you having to encode the spanId to a string, then split the traceId out from the span string.

@Dieterbe Dieterbe changed the title WIP chaos testing chaos testing + memberlist clustering changes + govendor -> dep Nov 30, 2017
it's like docker-cluster, but extra shardgroup
and run MT in single tenant mode,
running graphite in single tenant mode
should set the right org-id header
but for some reason it doesn't seem to be work
current versions like 4.6.0 have extra ssl/https checks
our graphite has a self signed ssl cert, but not for the
hostname 'graphitemon' that we give the container, so grafana
would complain.
I tried to use
https://github.com/graphite-project/docker-graphite-statsd instead
(which basically just requires switching the port to 80, and updating
the config files) but i couldn't get any data into it, so gave up on
that.
note: `out` is copied from fakemetrics.
the reason we had to do that is because otherwise go compiler would
complain about MetricData from schema package in MT's vendored dir vs
that of fakemetrics not being the same.
Hopefully we can clean that up later.
seeing an uninitialized response could be confusing
so use pointer, for which we use spew to show it nicely
This way we can use the latest grafana which has built-in annotations
(see next commit)
see also a7332fd

carbon config is same as default, minus all the comments, and
just these changes:

-MAX_UPDATES_PER_SECOND = 500
-MAX_CREATES_PER_MINUTE = 50
+MAX_UPDATES_PER_SECOND = 5000
+MAX_CREATES_PER_MINUTE = 5000
-CACHE_WRITE_STRATEGY = sorted
+CACHE_WRITE_STRATEGY = max
these settings are equivalent to default-local but with a 40s
gossip-to-the-dead-time

there's 2 reasons for this:

* before this, you could tell via the cluster dashboard that the cluster
was barely fast enough (and sometimes not all) to get the cluster in the
desired state, e.g. during the isolation test. now it is more reliable
though sometimes slightly too aggressive (sometimes a node that's
supposed to be healthy is split off for a few seconds) but it's hard
to find a balance. currently, mt4 feels split off for about 10s only
(should be about 30s)
whereas the others see mt4 split off for 25s. clearly this could use
more tweaking

* the gossip-to-the-dead-time allows for the cluster to reconverge to
the full healthy cluster, when the network is healthy again
previously, we didn't wait long enough and the cluster would stay split
forever, after the testIsolate test.
note: dep ensure adds a bunch of stuff that govendor used to filter out:
* test files
* main files
* non-go files
* root and sub packages when we only need 1 specific package in a tree
* other dependencies of root/sub packages that we didn't need in the
first place

but I then ran `dep prune` which cleans much of it up again

vendor updates:
package                          used by                     comments
vendor/github.com/golang/snappy  cmd/mt-replicator-via-tsdb
google/go-querystring            MT clustering               minor
github.com/hailocab/go-hostpool  cassandra
github.com/uber/jaeger-client-go tracing                     minor
gopkg.in/raintank/schema.v0      everything                  minor v1.5
* fix two cases where Init failed but err=nil was reported and
  processing continued erroreously.
* return annotated errors instead of bare
* returned errors will be printed by caller, don't need to be printed
  twice. especially not as a warning when it's an error
* typo fix
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Dec 1, 2017

how does this look now? @woodsaj @replay . mergey mergey?

@Dieterbe Dieterbe requested a review from replay December 4, 2017 14:17
@Dieterbe
Copy link
Contributor Author

@DanCech mentioned another issue: bind-address is needlessly validated when it's not set, or something like that

@Dieterbe Dieterbe merged commit c2d2647 into master Dec 11, 2017
}

var err error
swimBindAddr, err = net.ResolveTCPAddr("tcp", swimBindAddrStr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still being tested in every case though it's only used for manual, we should either only test it for manual config or use it for every config type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in b49c5e8

@Dieterbe Dieterbe deleted the chaos-pumba branch December 15, 2017 19:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants