-
Notifications
You must be signed in to change notification settings - Fork 105
chaos testing + memberlist clustering changes + govendor -> dep #760
Conversation
1f3a319
to
a057440
Compare
observed problems so far:
test suite detected 4 timeouts, in mt4 logs i see 2 requests that would have timed out the client.
maybe the completed lines were not logged? but:
|
one thing I've just learned is with current approach where we cut off all traffic from mt4 to the others, the others will remove mt4, and mt4 keeps getting messages from the others, but refutes them, and while it fails instances after ping timeouts, it doesn't seem to remove them from its list. so after the partition recovers, mt4 just talks to them again and the cluster is healed. |
fcf443c
to
3faea9e
Compare
3faea9e
to
95c8571
Compare
I would like to merge this PR, mainly because of these particular generally useful commits: Note :
If you're interested in running the chaos testing stuff, i've been doing it by running this command in the metrictank repo root:
(in my workflow I combine the output of the go tests with docker container logs and the included grafana dashboards in http://localhost:3000 ) |
api/middleware/tracer.go
Outdated
@@ -49,6 +51,13 @@ func Tracer(tracer opentracing.Tracer) macaron.Handler { | |||
macCtx.MapTo(macCtx.Resp, (*http.ResponseWriter)(nil)) | |||
|
|||
rw := macCtx.Resp.(*TracingResponseWriter) | |||
|
|||
headers := macCtx.Resp.Header() | |||
spanStr := span.(*jaeger.Span).String() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get the traceID with
span.Context().(jaeger.SpanContext).TraceId().String()
this saves you having to encode the spanId to a string, then split the traceId out from the span string.
it's like docker-cluster, but extra shardgroup and run MT in single tenant mode, running graphite in single tenant mode should set the right org-id header but for some reason it doesn't seem to be work
current versions like 4.6.0 have extra ssl/https checks our graphite has a self signed ssl cert, but not for the hostname 'graphitemon' that we give the container, so grafana would complain. I tried to use https://github.com/graphite-project/docker-graphite-statsd instead (which basically just requires switching the port to 80, and updating the config files) but i couldn't get any data into it, so gave up on that.
note: `out` is copied from fakemetrics. the reason we had to do that is because otherwise go compiler would complain about MetricData from schema package in MT's vendored dir vs that of fakemetrics not being the same. Hopefully we can clean that up later.
and remove redundant checks
seeing an uninitialized response could be confusing so use pointer, for which we use spew to show it nicely
This way we can use the latest grafana which has built-in annotations (see next commit) see also a7332fd carbon config is same as default, minus all the comments, and just these changes: -MAX_UPDATES_PER_SECOND = 500 -MAX_CREATES_PER_MINUTE = 50 +MAX_UPDATES_PER_SECOND = 5000 +MAX_CREATES_PER_MINUTE = 5000 -CACHE_WRITE_STRATEGY = sorted +CACHE_WRITE_STRATEGY = max
these settings are equivalent to default-local but with a 40s gossip-to-the-dead-time there's 2 reasons for this: * before this, you could tell via the cluster dashboard that the cluster was barely fast enough (and sometimes not all) to get the cluster in the desired state, e.g. during the isolation test. now it is more reliable though sometimes slightly too aggressive (sometimes a node that's supposed to be healthy is split off for a few seconds) but it's hard to find a balance. currently, mt4 feels split off for about 10s only (should be about 30s) whereas the others see mt4 split off for 25s. clearly this could use more tweaking * the gossip-to-the-dead-time allows for the cluster to reconverge to the full healthy cluster, when the network is healthy again previously, we didn't wait long enough and the cluster would stay split forever, after the testIsolate test.
note: dep ensure adds a bunch of stuff that govendor used to filter out: * test files * main files * non-go files * root and sub packages when we only need 1 specific package in a tree * other dependencies of root/sub packages that we didn't need in the first place but I then ran `dep prune` which cleans much of it up again vendor updates: package used by comments vendor/github.com/golang/snappy cmd/mt-replicator-via-tsdb google/go-querystring MT clustering minor github.com/hailocab/go-hostpool cassandra github.com/uber/jaeger-client-go tracing minor gopkg.in/raintank/schema.v0 everything minor v1.5
* fix two cases where Init failed but err=nil was reported and processing continued erroreously. * return annotated errors instead of bare * returned errors will be printed by caller, don't need to be printed twice. especially not as a warning when it's an error * typo fix
they are not reliable yet
even when it's no longer used....
because sometimes it's not enough
@DanCech mentioned another issue: bind-address is needlessly validated when it's not set, or something like that |
because these fatal errors may happen any time after the test function has completed
the defaults of 30s are too long
} | ||
|
||
var err error | ||
swimBindAddr, err = net.ResolveTCPAddr("tcp", swimBindAddrStr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still being tested in every case though it's only used for manual
, we should either only test it for manual config or use it for every config type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in b49c5e8
needs a bit more work, mostly sharing for anyone who's curious what i'm up to.
see README.md and chaos_test.go for the highlevel overview