replace statsd[aemon] metrics with built-in metrics library #384

Dieterbe · 2016-11-13T19:15:20Z

just adding my prototype new library.
didn't do any work integrating it into metrictank yet because the code base needs to stabilize first (e.g. after merging at least the http refactor, potentially also clustering)

woodsaj · 2016-11-14T03:52:10Z

met/latencyhistogram32.go

+		buf = WriteUint32(buf, l.prefix, []byte("p75"), r.P75/1000, now)
+		buf = WriteUint32(buf, l.prefix, []byte("p90"), r.P90/1000, now)
+		buf = WriteUint32(buf, l.prefix, []byte("max"), r.Max/1000, now)
+		buf = WriteUint32(buf, l.prefix, []byte("count"), r.Count/1000, now)


i dont think you should be dividing count by 1000

Dieterbe · 2016-11-27T18:59:43Z

now rebased on top of clustering. made some good progress, but still a bunch of things to switch over.
also added another artisanalhistogram type for "slower" things (500ms ~ 12h) and a plain histogram type for non-timing things.

Dieterbe · 2016-12-22T10:23:06Z

Status:

todo: more testing especially memory usage and under a http workload (which means i will work on the new index dump tool cause i need that to feed vegeta)
todo: add provision to clear the stats registry in between unit tests (see build failures)
todo: update metric descriptions, run metrics2docs, update ops guide
that said, ready for reviewing again

Some interesting notes (please also read the commit messages for more details):

under a workload of switching from 80k ingest to 180k (e.g. lots of index writes), as well as under a 180k steady workload using the carbon input, stats.(*Meter32).ValueUint32 shows up taking 0.5% cpu, that's the only stats function in the profile
I decided to put the metric type at the end of each metric. the types are named after http://metrics20.org/spec/#tag-values-mtype but also include the bit size. (note gauge1 for booleans. this may be a bit too cryptic, but bool is not a metrics2.0 mtype). Contrast to statsd where it puts the type very early on in the tree. I like this more because it allows putting related measurements together better, makes it easier to navigate the tree (because with statsd you often don't know what type it is and need to search through multiple subtrees) and it also makes things a bit more clear what a metric means exactly (like counter has to be derived to get the rate). see 25f9c34 for more thoughts on this.
right now a bunch of the values are reported as counters, I like this. but it does mean it needs to be derived (using perSecond() ), which is a slight extra hassle. we can consider reporting it as a rate.
several of the metrics used to be in arbitrary places, now everything is more neatly organized in sections. (api, cassandra, idx, input, ...) see https://gist.github.com/Dieterbe/2855e2a8d534ac7021fcb9d1d269b937 for a current dump. (note tank which is metrics about the in memory storage). some more changes I'ld like to make:
- notifier -> cluster.notifier
- cassandra -> store.cassandra (contrast to idx.cassandra)
- longer term we can also refine this more, and add per http-path latencies under api in a more structured way.
the included dashboard.json has been updated and i verified all metrics in all panels, everything should be good. though later it may make more sense to remodel the dashboard a bit, and do rows per "thing" (e.g. a row for global stats, a row for tank metrics, a row per input, etc). We have this already more or less, but it could be clearer and more explicit
while we use basic histograms for things like chunk sizes, when we measure latencies that we know should be within specific ranges (e.g. 1ms ~15s) I'm trying out a "human friendly" histogram class (see https://github.com/Dieterbe/artisanalhistogram). once grafana can display heatmaps, this will provide class boundaries on human friendly boundaries (even numbers). but still need to do the http tests to see if it works

if you want to run it and see for yourself,it's easy:

make bin
cd docker/docker-dev-custom-cfg-kafka
docker-compose down
docker-compose up --force-recreate
(.. wait until everything is up)
../extra/populate-grafana.sh dev-custom-cfg-kafka  # this installs the datasource and dashboard
then go to localhost:3000 in your browser and load metrictank dashboard, change datasource to metrictank

replay · 2016-12-22T11:00:50Z

cmd/mt-index-cat/main.go

+		log.Fatal("invalid output")
+	}
+
+	// from should either a unix timestamp, or a specification that graphite/metrictank will recognize.


either be a

Dieterbe · 2016-12-27T21:04:39Z

running ./build/mt-index-cat -from=60min cass -keyspace raintank vegeta-render | vegeta attack -rate 1000 -duration 2m under a 50kHz ingest, and metrictank.stats is about 0.5% of cpu.

as for memory used, all memory of the stats library (under ingest +http workload) is due to the graphite queue, which can be sized as desired.

Dieterbe · 2016-12-27T22:57:17Z

@woodsaj @replay I just have to fix the end2end test, otherwise I think this in good shape for merging. time for reviewing :)

we just update it every time we add or remove metrics. technically we only need the value once every second, but the previous approach would do it at a time not matching when we send metrics. Now we could synchronize them via a callback but it would be a little too messy. we can do this about 200M/second so we're good. package mdata import ( "sync/atomic" "testing" ) var V uint32 func Benchmark_Uint32(b *testing.B) { var v uint32 for i := 0; i < b.N; i++ { atomic.StoreUint32(&v, uint32(i)) } V = v } go test -run='^$' -bench=Uint Benchmark_Uint32-8 300000000 5.14 ns/op PASS ok github.com/raintank/metrictank/mdata 2.068s

the filled areas was a bit too much. the lines are tighter. also null=connected because this overflows over ~4billion very regularly

most times you import the dash (docker stack, fresh install) you want to zoom in on recent stats. Seems more common than importing a dash for an already existing setup.

Dieterbe · 2016-12-30T11:44:55Z

@replay what do you think of the new graphite conn routine? i haven't subjected it to the tests outlined in #384 (comment), but will, if the code looks good to you.
also what do you think of 767502e ?

replay · 2016-12-30T14:13:55Z

mdata/notifierNsq/instrumented_nsq/consumer.go

+			c.msgsConnections.Set(s.Connections)
+		}
+	}()
+	return &c, err


if there was an err at nsq.NewConsumer() wouldn't it probably be better to return right there, without creating a go routine that's probably going to fail once a second

replay · 2016-12-30T14:28:51Z

cassandra/metrics.go

+func NewErrMetrics(component string) ErrMetrics {
+	return ErrMetrics{
+
+		// metric idx.cassandra.error.timeout is a counter of timeouts seen to the cassandra idx


seems outdated, no?

we cleared this up on slack. but basically https://github.com/Dieterbe/metrics2docs is a bit simplistic, it can't parse the code very well, and as there are 2 places where we dynamically construct these metrics:

mdata/store_cassandra.go 68: errmetrics = cassandra.NewErrMetrics("store.cassandra") idx/cassandra/cassandra.go 48: errmetrics = cassandra.NewErrMetrics("idx.cassandra")

we need to add two comments for each metric here, and there needs to be an empty line between them because otherwise metrics2docs fails

replay · 2016-12-30T16:28:11Z

docker/docker-dev-custom-cfg-kafka/metrictank.ini

+## misc ##
+
+# instance identifier. must be unique. used in clustering messages, for naming queue consumers and emitted metrics.
+instance = defaulti


not sure, but is that a typo?

replay · 2016-12-30T16:37:31Z

docker/docker-dev-custom-cfg-kafka/metrictank.ini

+# The default matches what the Grafana dashboard expects
+# $instance will be replaced with the `instance` setting.
+# note, the 3rd word describes the environment you deployed in.
+prefix = metrictank.stats.defaulte.$instance


is defaulte another typo or do i just totally not get what you did there?

this was because i wanted to make sure the environment and instance showed up in the right place (default.default was a bit ambiguous). but I can undo that now and set both to default

replay · 2016-12-30T18:00:41Z

stats/gauge64.go

+}
+
+func (g *Gauge64) Dec() {
+	atomic.AddUint64(&g.val, ^uint64(0))


replay · 2016-12-30T18:12:15Z

stats/meter32.go

+		m.max = val
+	}
+	c := m.hist[bin] + 1
+	m.hist[bin] = c


couldn't this be m.hist[bin]++? c does not seem needed

replay · 2016-12-30T18:13:09Z

stats/meter32.go

+	for _, k := range keys {
+		key := uint32(k)
+		runningcount += m.hist[key]
+		runningsum += uint64(m.hist[key]) * uint64(key)


why does that get multiplied with uint64(key)? maybe a comment would be good

it's the running sum. so it has to multiply the value in the bucket with the number of samples that had that value.

replay · 2016-12-30T18:27:04Z

stats/out_graphite.go

+	for {
+		select {
+		case <-connectTicker:
+			conn = assureConn()


sorry that i keep going on about this :) but what's the purpose of still having this connectTicker here?

when a new message gets read from g.toGraphite ok will be initialized to false on line 115, so the conn will be created by assureConn() on 117

if during writing the connection fails then err will be set to != nil on line 119, so ok never gets set to true, so it will loop on the for at 116 and reconnect at 117

Why still have the ticker then?

ha you're right again I think.
let me do another round of simplifying :)

woodsaj · 2017-01-02T06:26:31Z

stats/registry.go

+func (r *Registry) add(name string, metric GraphiteMetric) GraphiteMetric {
+	r.Lock()
+	if _, ok := r.metrics[name]; ok {
+		panic(fmt.Sprintf(errFmtMetricExists, name))


I really dont think this is an un-recoverable error. We should return an error instead.

Perhaps it would be better to use an addOrGet() method that either returns an existing graphiteMetric with the specified name or creates the new one.

I think AddOrGet is a good idea.
However, the pre-existing metric may be a different type. E.g. trying to register a Bool, but there's already a Meter32 or something. There's not much useful we can do in this scenario (we could keep running but without the instrumentation, which I think is pretty dangerous and we'd have to make the code more complicated to support a graphite metric type which is a no-op). So if the pre-existing one is different type, i think a panic is still warranted.

if pre-existing is the same type, then we can totally just return that.

sounds good.

Dieterbe · 2017-01-03T14:32:25Z

made the change woodsaj requested, tested the new loop using the same experiments as #384 (comment) and it all still works as before.

merged to master from commandline.

woodsaj reviewed Nov 14, 2016

View reviewed changes

Dieterbe force-pushed the builtin-metrics branch 2 times, most recently from 552aa64 to 6f36d4c Compare November 14, 2016 08:29

Dieterbe mentioned this pull request Nov 26, 2016

excessive metrics sent to statsdaemon, leads to high statsdaemon usage #335

Closed

Dieterbe force-pushed the builtin-metrics branch 2 times, most recently from 6bf366c to 0a511c2 Compare November 27, 2016 17:41

Dieterbe force-pushed the builtin-metrics branch from 0a511c2 to fc1fd85 Compare December 15, 2016 11:35

Dieterbe self-assigned this Dec 15, 2016

Dieterbe added the this week label Dec 15, 2016

Dieterbe added this to the hosted-metrics-alpha milestone Dec 15, 2016

Dieterbe force-pushed the builtin-metrics branch 7 times, most recently from 6428ed6 to a7aabea Compare December 22, 2016 09:21

replay reviewed Dec 22, 2016

View reviewed changes

Dieterbe force-pushed the builtin-metrics branch 4 times, most recently from 46daacc to f6b9b31 Compare December 27, 2016 20:13

Dieterbe force-pushed the builtin-metrics branch 2 times, most recently from 8f86076 to 975c33b Compare December 27, 2016 22:56

Dieterbe changed the title ~~WIP replace statsd[aemon] metrics with built-in metrics library~~ replace statsd[aemon] metrics with built-in metrics library Dec 28, 2016

Dieterbe force-pushed the builtin-metrics branch from 975c33b to 0378625 Compare December 28, 2016 09:17

Dieterbe added 8 commits December 30, 2016 12:42

simplify how we instantiate new stats

98e9678

better lock

e8def85

better pressure visualization

622e18f

the filled areas was a bit too much. the lines are tighter. also null=connected because this overflows over ~4billion very regularly

fix stats buffer-size option

47cca8b

add metrics about stats itself

4c9fbc2

better default from

d40e06e

most times you import the dash (docker stack, fresh install) you want to zoom in on recent stats. Seems more common than importing a dash for an already existing setup.

support calling Start and Stop over and over

5eac470

Dieterbe force-pushed the builtin-metrics branch from 9d300bc to a6823c5 Compare December 30, 2016 11:42

simplify outbound to graphite conn routine

650d027

Dieterbe force-pushed the builtin-metrics branch from a6823c5 to 650d027 Compare December 30, 2016 12:29

replay reviewed Dec 30, 2016

View reviewed changes

on err, can bail out earlier + simplify a bit

7b46143

replay reviewed Dec 30, 2016

View reviewed changes

stats/gauge64.go

}

func (g *Gauge64) Dec() {

atomic.AddUint64(&g.val, ^uint64(0))

Copy link

Contributor

replay Dec 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

replay reviewed Dec 30, 2016

View reviewed changes

stats/meter32.go

m.max = val

}

c := m.hist[bin] + 1

m.hist[bin] = c

Copy link

Contributor

replay Dec 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't this be m.hist[bin]++? c does not seem needed

replay reviewed Dec 30, 2016

View reviewed changes

simplify

f2cfaae

Dieterbe added in progress and removed this week labels Jan 1, 2017

woodsaj suggested changes Jan 2, 2017

View reviewed changes

Dieterbe mentioned this pull request Jan 2, 2017

add some instrumentation to chunk cache #446

Closed

Dieterbe closed this Jan 3, 2017

Dieterbe removed the in progress label Jan 3, 2017

Dieterbe deleted the builtin-metrics branch September 18, 2018 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace statsd[aemon] metrics with built-in metrics library #384

replace statsd[aemon] metrics with built-in metrics library #384

Dieterbe commented Nov 13, 2016 •

edited

Loading

woodsaj Nov 14, 2016

Dieterbe commented Nov 27, 2016 •

edited

Loading

Dieterbe commented Dec 22, 2016 •

edited

Loading

replay Dec 22, 2016 •

edited

Loading

Dieterbe commented Dec 27, 2016

Dieterbe commented Dec 27, 2016

Dieterbe commented Dec 30, 2016

replay Dec 30, 2016 •

edited

Loading

Dieterbe Dec 30, 2016

replay Dec 30, 2016

Dieterbe Dec 30, 2016

replay Dec 30, 2016

replay Dec 30, 2016

Dieterbe Dec 30, 2016

replay Dec 30, 2016

replay Dec 30, 2016

replay Dec 30, 2016

Dieterbe Dec 30, 2016

replay Dec 30, 2016

Dieterbe Dec 30, 2016

woodsaj Jan 2, 2017

Dieterbe Jan 2, 2017

woodsaj Jan 2, 2017

Dieterbe commented Jan 3, 2017

replace statsd[aemon] metrics with built-in metrics library #384

replace statsd[aemon] metrics with built-in metrics library #384

Conversation

Dieterbe commented Nov 13, 2016 • edited Loading

Choose a reason for hiding this comment

Dieterbe commented Nov 27, 2016 • edited Loading

Dieterbe commented Dec 22, 2016 • edited Loading

replay Dec 22, 2016 • edited Loading

Choose a reason for hiding this comment

Dieterbe commented Dec 27, 2016

Dieterbe commented Dec 27, 2016

Dieterbe commented Dec 30, 2016

replay Dec 30, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe commented Jan 3, 2017

Dieterbe commented Nov 13, 2016 •

edited

Loading

Dieterbe commented Nov 27, 2016 •

edited

Loading

Dieterbe commented Dec 22, 2016 •

edited

Loading

replay Dec 22, 2016 •

edited

Loading

replay Dec 30, 2016 •

edited

Loading