Idx metrics and fixes #504

Dieterbe · 2017-01-26T17:01:15Z

No description provided.

Dieterbe · 2017-01-26T17:10:36Z

need to some testing and update the dashboard, but hardest work was designing the metrics tree, trying to keep things consistent across the different indexes, with their different implementations (memory with direct realtime operations vs cass with sometimes-delayed queries vs ES with its bulkindexer)

another thought I have is make it so that when you use cassIdx or EsIdx, you don't get the metrics with idx.memory prefix since there's lots of redundancy between cass/es and memory index, except for Get/Find/List, need to think about this a bit more if it makes sense and can be done elegantly.

TODO also: remove ES index.

Dieterbe · 2017-01-27T12:38:06Z

ready for review.

replay · 2017-01-27T13:26:43Z

The file docker/extra/dashboards/monitoring-stack.json seems to still have elastic search stuff

replay · 2017-01-27T13:29:19Z

docs/metrics.md

 * `idx.cassandra.delete`:  
-the duration of deletions from the cassandra idx
+the duration of a delete of one or metrics from the cassandra idx, including the delete from the in-memory index and the delete query


one or more

replay · 2017-01-27T13:30:26Z

docs/metrics.md

 * `idx.memory.add`:  
-the duration of (successfull) memory idx additions
+the duration of a (successfull) add of a metric to the memory idx


successful with one l

replay · 2017-01-27T13:31:16Z

docs/metrics.md

+* `idx.memory.prune`:  
+the duration of successful memory idx prunes
+* `idx.memory.update`:  
+the duration of (successfull) update of a metric to the memory idx


successful with one l

replay · 2017-01-27T13:34:55Z

idx/cassandra/cassandra.go

+	// metric idx.cassandra.query-delete.exec is time spent executing deletes (possibly repeatedly until success)
+	statQueryDeleteExecDuration = stats.NewLatencyHistogram15s32("idx.cassandra.query-delete.exec")
+
+	// metric idx.cassandra.add is the duration of an add of one metric to the cassandra idx, including the add to the in-memory index, excluding the insert query


that seems a little confusing: it is the duration of an add of a metric to the cassandra idx, excluding the insert query? what is it then, just the add to the in-memory index?

you're right. it is confusing. the "problem" is the index uses an async worker that executes the queries in the background. so this metric covers add to-in-memory as well as generating the metricdefinition and putting it on the write queue.

in general these durations all cover the "foreground" stuff, and exclude the "background" stuff. perhaps in docs/metadata.md we should clarify what is foreground and background, instead of trying to stuff everything in these docs.

replay · 2017-01-27T13:41:44Z

idx/cassandra/cassandra.go

+					c.writeQueue <- writeReq{recvTime: time.Now(), def: def}
+				}
+				statUpdateDuration.Value(time.Since(pre))
+				return err
 			}
 			return nil
 		} else {


this else block seems unnecessary because the previous if block always returns.

replay · 2017-01-27T14:07:03Z

idx/memory/memory.go

-	// metric idx.memory.get is the duration of memory idx gets
-	idxGetDuration = stats.NewLatencyHistogram15s32("idx.memory.get")
+	statAddFail = stats.NewCounter32("idx.memory.add.fail")
+	// metric idx.memory.add is the duration of a (successfull) add of a metric to the memory idx


successful with one l

replay · 2017-01-27T14:07:13Z

idx/memory/memory.go

+	statAddFail = stats.NewCounter32("idx.memory.add.fail")
+	// metric idx.memory.add is the duration of a (successfull) add of a metric to the memory idx
+	statAddDuration = stats.NewLatencyHistogram15s32("idx.memory.add")
+	// metric idx.memory.update is the duration of (successfull) update of a metric to the memory idx


replay · 2017-01-27T14:08:22Z

idx/memory/memory.go

+	// metric idx.memory.prune is the duration of successful memory idx prunes
+	statPruneDuration = stats.NewLatencyHistogram15s32("idx.memory.prune")
+
+	// metric idx.metrics_active is the amount of currently known metrics in the index


that might be picky: i think the number of is more appropriate than the amount of because metrics is countable and not a mass of something

replay · 2017-01-27T14:13:20Z

idx/memory/memory.go

 	for i := range defs {
 		def := defs[i]
 		pre = time.Now()
 		if _, ok := m.DefById[def.Id]; ok {
 			continue
 		}
-		m.add(&def)
-		idxAddDuration.Value(time.Since(pre))
+		err := m.add(&def)


i'm not sure about if that's the same in Go, but in C thinking I'd say:
this seems like on line 131 you copy the def by value and here you pass a reference to that copy. wouldn't it safe one copying if on line 131 you had def := &defs[1]?

yes, but then you'd have an extra pointer dereference / memory lookup. not sure which is best.
we could probably optimize this but that's out of scope of this PR and i don't have much time now to give this the attention to detail it needs, it's also rather low prio.

woodsaj · 2017-01-31T10:44:20Z

Other then @replay 's comments, this looks good to me

append() will allocate when needed

+ less work for Sprintf

1) accurately report AddOrUpdate error 2) only update store when memory update succeeded (only cassandra does this. not ES. we're not changing that here. see #500)

Note this requires an API change: If we restrict our index implementations to ones that are built around the memory index (which seems to be the case for the foreseeable future) then we don't do Get operations that can fail (result in an error condition) due to io operations failing or something. So Get cannot run into error conditions, so we can simplify the Get signature to return a 'found' bool instead of an error. Since AddOrUpdate first calls Get, and then tries to add to the index, this lets us call AddOrUpdate, knowing that errors will only happen if something's wrong with the metric and it can't be added to the index, and no longer due to the Get() step. Which in turn means we can confidently say there's no point adding the data in that case, as the error cannot be due to a transient condition, but rather something fundamentally wrong why we can't accept it. Once the metric can be added to the index, we save the data.

* split up updates vs adds. correctly categorize them * clarify and assure that all ES/cass operation (add, update, delete) metrics also include the memory part (which was already the case for most, but not all operations. and was not documented well) * operations on ES index (add/delete/update) turn into backend operations that get mingled into larger bulk operations, clarify the ambiguity. don't pretend it was an add, make it clear we can't tell for sure what kind of operation it was. * similar for Cassandra: the backend executes inserts (on behalf of adds and updates) and deletes (on behalf of updates and deletes) Even if we tied back query results to the original command, e.g. we could increment an update success counter when an insert succeeds, if we track that insert was due to an update, the update might still not be successfull if the delete fails. So the ok/fail counters should really be for queries. * also time prune operations * specify whether durations concern one metric, or several (e.g. a pattern delete) fix #304

* add metrics-active panel on top. showing tank + idx's respective views * update the row with index stats, show metrics-active again, add the new durations

* it doesn't perform well * we don't use it and discourage it * it is not well maintained, and has some known issues fix #500 fix #387 closes #297

Dieterbe requested review from replay and woodsaj January 27, 2017 12:38

replay reviewed Jan 27, 2017

View reviewed changes

woodsaj mentioned this pull request Jan 30, 2017

remove branch vs leaf node restriction in index #490

Merged

Dieterbe force-pushed the idx-metrics-and-fixes branch from 1d82ecc to 4f503d4 Compare January 31, 2017 19:28

Dieterbe added 13 commits January 31, 2017 20:31

track number of entries in index. fix #318

7565fbf

fix: properly report any errors encountered during index load

cf2f3f9

avoid some unnessecary allocations

51ee688

append() will allocate when needed

simplify Node.String()

27c3dba

+ less work for Sprintf

clarify add vs update part 1: assure correct method names

f9a272c

AddOrUpdate fixes

40c2f15

1) accurately report AddOrUpdate error 2) only update store when memory update succeeded (only cassandra does this. not ES. we're not changing that here. see #500)

cassandra: track queue wait separetely from query execution. fix #349

0baedda

simpler, more consistent idx metric names

7319db2

update dashboard

afadce2

* add metrics-active panel on top. showing tank + idx's respective views * update the row with index stats, show metrics-active again, add the new durations

deprecate ES index

3a43353

* it doesn't perform well * we don't use it and discourage it * it is not well maintained, and has some known issues fix #500 fix #387 closes #297

metrics2docs

1a87742

Dieterbe added 4 commits January 31, 2017 20:31

remove template variable values that don't make sense for others

8d5936d

use "number of" for countable things

d18d7fd

typo

e53bf3e

simplify

fd66afc

Dieterbe force-pushed the idx-metrics-and-fixes branch from 4f503d4 to fd66afc Compare January 31, 2017 19:34

Dieterbe merged commit 99ad125 into master Jan 31, 2017

This was referenced Feb 9, 2017

smarter bulkindex throttling #215

Closed

bulk indexing metrics #214

Closed

Dieterbe mentioned this pull request Mar 8, 2017

add metricdefs to cassandra if memory idx DID succeed #560

Merged

Dieterbe deleted the idx-metrics-and-fixes branch September 18, 2018 09:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idx metrics and fixes #504

Idx metrics and fixes #504

Dieterbe commented Jan 26, 2017

Dieterbe commented Jan 26, 2017 •

edited

Loading

Dieterbe commented Jan 27, 2017

replay commented Jan 27, 2017

replay Jan 27, 2017

replay Jan 27, 2017

replay Jan 27, 2017

replay Jan 27, 2017

Dieterbe Jan 27, 2017

replay Jan 27, 2017 •

edited

Loading

replay Jan 27, 2017

replay Jan 27, 2017

replay Jan 27, 2017

replay Jan 27, 2017

Dieterbe Jan 31, 2017

woodsaj commented Jan 31, 2017

Idx metrics and fixes #504

Idx metrics and fixes #504

Conversation

Dieterbe commented Jan 26, 2017

Dieterbe commented Jan 26, 2017 • edited Loading

Dieterbe commented Jan 27, 2017

replay commented Jan 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

replay Jan 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woodsaj commented Jan 31, 2017

Dieterbe commented Jan 26, 2017 •

edited

Loading

replay Jan 27, 2017 •

edited

Loading