-
Notifications
You must be signed in to change notification settings - Fork 105
Conversation
need to some testing and update the dashboard, but hardest work was designing the metrics tree, trying to keep things consistent across the different indexes, with their different implementations (memory with direct realtime operations vs cass with sometimes-delayed queries vs ES with its bulkindexer) another thought I have is make it so that when you use cassIdx or EsIdx, you don't get the metrics with TODO also: remove ES index. |
ready for review. |
The file |
docs/metrics.md
Outdated
* `idx.cassandra.delete`: | ||
the duration of deletions from the cassandra idx | ||
the duration of a delete of one or metrics from the cassandra idx, including the delete from the in-memory index and the delete query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one or more
docs/metrics.md
Outdated
* `idx.memory.add`: | ||
the duration of (successfull) memory idx additions | ||
the duration of a (successfull) add of a metric to the memory idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
successful with one l
docs/metrics.md
Outdated
* `idx.memory.prune`: | ||
the duration of successful memory idx prunes | ||
* `idx.memory.update`: | ||
the duration of (successfull) update of a metric to the memory idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
successful with one l
// metric idx.cassandra.query-delete.exec is time spent executing deletes (possibly repeatedly until success) | ||
statQueryDeleteExecDuration = stats.NewLatencyHistogram15s32("idx.cassandra.query-delete.exec") | ||
|
||
// metric idx.cassandra.add is the duration of an add of one metric to the cassandra idx, including the add to the in-memory index, excluding the insert query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that seems a little confusing: it is the duration of an add of a metric to the cassandra idx, excluding the insert query? what is it then, just the add to the in-memory index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. it is confusing. the "problem" is the index uses an async worker that executes the queries in the background. so this metric covers add to-in-memory as well as generating the metricdefinition and putting it on the write queue.
in general these durations all cover the "foreground" stuff, and exclude the "background" stuff. perhaps in docs/metadata.md we should clarify what is foreground and background, instead of trying to stuff everything in these docs.
idx/cassandra/cassandra.go
Outdated
c.writeQueue <- writeReq{recvTime: time.Now(), def: def} | ||
} | ||
statUpdateDuration.Value(time.Since(pre)) | ||
return err | ||
} | ||
return nil | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this else
block seems unnecessary because the previous if
block always returns.
idx/memory/memory.go
Outdated
// metric idx.memory.get is the duration of memory idx gets | ||
idxGetDuration = stats.NewLatencyHistogram15s32("idx.memory.get") | ||
statAddFail = stats.NewCounter32("idx.memory.add.fail") | ||
// metric idx.memory.add is the duration of a (successfull) add of a metric to the memory idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
successful with one l
idx/memory/memory.go
Outdated
statAddFail = stats.NewCounter32("idx.memory.add.fail") | ||
// metric idx.memory.add is the duration of a (successfull) add of a metric to the memory idx | ||
statAddDuration = stats.NewLatencyHistogram15s32("idx.memory.add") | ||
// metric idx.memory.update is the duration of (successfull) update of a metric to the memory idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^
idx/memory/memory.go
Outdated
// metric idx.memory.prune is the duration of successful memory idx prunes | ||
statPruneDuration = stats.NewLatencyHistogram15s32("idx.memory.prune") | ||
|
||
// metric idx.metrics_active is the amount of currently known metrics in the index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that might be picky: i think the number of
is more appropriate than the amount of
because metrics
is countable and not a mass of something
for i := range defs { | ||
def := defs[i] | ||
pre = time.Now() | ||
if _, ok := m.DefById[def.Id]; ok { | ||
continue | ||
} | ||
m.add(&def) | ||
idxAddDuration.Value(time.Since(pre)) | ||
err := m.add(&def) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure about if that's the same in Go, but in C thinking I'd say:
this seems like on line 131
you copy the def
by value and here you pass a reference to that copy. wouldn't it safe one copying if on line 131
you had def := &defs[1]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but then you'd have an extra pointer dereference / memory lookup. not sure which is best.
we could probably optimize this but that's out of scope of this PR and i don't have much time now to give this the attention to detail it needs, it's also rather low prio.
Other then @replay 's comments, this looks good to me |
1d82ecc
to
4f503d4
Compare
append() will allocate when needed
+ less work for Sprintf
1) accurately report AddOrUpdate error 2) only update store when memory update succeeded (only cassandra does this. not ES. we're not changing that here. see #500)
Note this requires an API change: If we restrict our index implementations to ones that are built around the memory index (which seems to be the case for the foreseeable future) then we don't do Get operations that can fail (result in an error condition) due to io operations failing or something. So Get cannot run into error conditions, so we can simplify the Get signature to return a 'found' bool instead of an error. Since AddOrUpdate first calls Get, and then tries to add to the index, this lets us call AddOrUpdate, knowing that errors will only happen if something's wrong with the metric and it can't be added to the index, and no longer due to the Get() step. Which in turn means we can confidently say there's no point adding the data in that case, as the error cannot be due to a transient condition, but rather something fundamentally wrong why we can't accept it. Once the metric can be added to the index, we save the data.
* split up updates vs adds. correctly categorize them * clarify and assure that all ES/cass operation (add, update, delete) metrics also include the memory part (which was already the case for most, but not all operations. and was not documented well) * operations on ES index (add/delete/update) turn into backend operations that get mingled into larger bulk operations, clarify the ambiguity. don't pretend it was an add, make it clear we can't tell for sure what kind of operation it was. * similar for Cassandra: the backend executes inserts (on behalf of adds and updates) and deletes (on behalf of updates and deletes) Even if we tied back query results to the original command, e.g. we could increment an update success counter when an insert succeeds, if we track that insert was due to an update, the update might still not be successfull if the delete fails. So the ok/fail counters should really be for queries. * also time prune operations * specify whether durations concern one metric, or several (e.g. a pattern delete) fix #304
* add metrics-active panel on top. showing tank + idx's respective views * update the row with index stats, show metrics-active again, add the new durations
4f503d4
to
fd66afc
Compare
No description provided.