Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Idx metrics and fixes #504

Merged
merged 17 commits into from
Jan 31, 2017
Merged

Idx metrics and fixes #504

merged 17 commits into from
Jan 31, 2017

Conversation

Dieterbe
Copy link
Contributor

No description provided.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jan 26, 2017

need to some testing and update the dashboard, but hardest work was designing the metrics tree, trying to keep things consistent across the different indexes, with their different implementations (memory with direct realtime operations vs cass with sometimes-delayed queries vs ES with its bulkindexer)

another thought I have is make it so that when you use cassIdx or EsIdx, you don't get the metrics with idx.memory prefix since there's lots of redundancy between cass/es and memory index, except for Get/Find/List, need to think about this a bit more if it makes sense and can be done elegantly.

TODO also: remove ES index.

@Dieterbe
Copy link
Contributor Author

ready for review.

@replay
Copy link
Contributor

replay commented Jan 27, 2017

The file docker/extra/dashboards/monitoring-stack.json seems to still have elastic search stuff

docs/metrics.md Outdated
* `idx.cassandra.delete`:
the duration of deletions from the cassandra idx
the duration of a delete of one or metrics from the cassandra idx, including the delete from the in-memory index and the delete query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one or more

docs/metrics.md Outdated
* `idx.memory.add`:
the duration of (successfull) memory idx additions
the duration of a (successfull) add of a metric to the memory idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

successful with one l

docs/metrics.md Outdated
* `idx.memory.prune`:
the duration of successful memory idx prunes
* `idx.memory.update`:
the duration of (successfull) update of a metric to the memory idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

successful with one l

// metric idx.cassandra.query-delete.exec is time spent executing deletes (possibly repeatedly until success)
statQueryDeleteExecDuration = stats.NewLatencyHistogram15s32("idx.cassandra.query-delete.exec")

// metric idx.cassandra.add is the duration of an add of one metric to the cassandra idx, including the add to the in-memory index, excluding the insert query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems a little confusing: it is the duration of an add of a metric to the cassandra idx, excluding the insert query? what is it then, just the add to the in-memory index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. it is confusing. the "problem" is the index uses an async worker that executes the queries in the background. so this metric covers add to-in-memory as well as generating the metricdefinition and putting it on the write queue.

in general these durations all cover the "foreground" stuff, and exclude the "background" stuff. perhaps in docs/metadata.md we should clarify what is foreground and background, instead of trying to stuff everything in these docs.

c.writeQueue <- writeReq{recvTime: time.Now(), def: def}
}
statUpdateDuration.Value(time.Since(pre))
return err
}
return nil
} else {
Copy link
Contributor

@replay replay Jan 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this else block seems unnecessary because the previous if block always returns.

// metric idx.memory.get is the duration of memory idx gets
idxGetDuration = stats.NewLatencyHistogram15s32("idx.memory.get")
statAddFail = stats.NewCounter32("idx.memory.add.fail")
// metric idx.memory.add is the duration of a (successfull) add of a metric to the memory idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

successful with one l

statAddFail = stats.NewCounter32("idx.memory.add.fail")
// metric idx.memory.add is the duration of a (successfull) add of a metric to the memory idx
statAddDuration = stats.NewLatencyHistogram15s32("idx.memory.add")
// metric idx.memory.update is the duration of (successfull) update of a metric to the memory idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^

// metric idx.memory.prune is the duration of successful memory idx prunes
statPruneDuration = stats.NewLatencyHistogram15s32("idx.memory.prune")

// metric idx.metrics_active is the amount of currently known metrics in the index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that might be picky: i think the number of is more appropriate than the amount of because metrics is countable and not a mass of something

for i := range defs {
def := defs[i]
pre = time.Now()
if _, ok := m.DefById[def.Id]; ok {
continue
}
m.add(&def)
idxAddDuration.Value(time.Since(pre))
err := m.add(&def)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure about if that's the same in Go, but in C thinking I'd say:
this seems like on line 131 you copy the def by value and here you pass a reference to that copy. wouldn't it safe one copying if on line 131 you had def := &defs[1]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but then you'd have an extra pointer dereference / memory lookup. not sure which is best.
we could probably optimize this but that's out of scope of this PR and i don't have much time now to give this the attention to detail it needs, it's also rather low prio.

@woodsaj
Copy link
Member

woodsaj commented Jan 31, 2017

Other then @replay 's comments, this looks good to me

append() will allocate when needed
+ less work for Sprintf
1) accurately report AddOrUpdate error
2) only update store when memory update succeeded
   (only cassandra does this. not ES. we're not changing that here.
   see #500)
Note this requires an API change:
If we restrict our index implementations to ones that are built around the memory index (which seems to be the case for the foreseeable future)
then we don't do Get operations that can fail (result in an error condition) due to io operations failing or something.
So Get cannot run into error conditions, so we can simplify the Get signature to return a 'found' bool instead of an error.
Since AddOrUpdate first calls Get, and then tries to add to the index,
this lets us call AddOrUpdate, knowing that errors will only happen if something's wrong with the metric and it can't be added to the index,
and no longer due to the Get() step.
Which in turn means we can confidently say there's no point adding the data in that case, as the error cannot be due to a transient condition,
but rather something fundamentally wrong why we can't accept it.

Once the metric can be added to the index, we save the data.
* split up updates vs adds. correctly categorize them
* clarify and assure that all ES/cass operation (add, update, delete) metrics also include the memory part
  (which was already the case for most, but not all operations. and was not documented well)
* operations on ES index (add/delete/update) turn into backend operations that
  get mingled into larger bulk operations, clarify the ambiguity.
  don't pretend it was an add, make it clear we can't tell for sure what
  kind of operation it was.
* similar for Cassandra: the backend executes inserts (on behalf of adds
  and updates) and deletes (on behalf of updates and deletes)
  Even if we tied back query results to the original command, e.g. we
  could increment an update success counter when an insert succeeds, if
  we track that insert was due to an update, the update might still not
  be successfull if the delete fails. So the ok/fail counters should
  really be for queries.
* also time prune operations
* specify whether durations concern one metric, or several (e.g. a pattern delete)

fix #304
* add metrics-active panel on top. showing tank + idx's respective views
* update the row with index stats, show metrics-active again, add the
  new durations
* it doesn't perform well
* we don't use it and discourage it
* it is not well maintained, and has some known issues

fix #500
fix #387
closes #297
@Dieterbe Dieterbe merged commit 99ad125 into master Jan 31, 2017
This was referenced Feb 9, 2017
@Dieterbe Dieterbe deleted the idx-metrics-and-fixes branch September 18, 2018 09:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants