Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

update memory index prune to hold write lock as briefly as possible #787

Merged
merged 1 commit into from
Dec 21, 2017

Conversation

DanCech
Copy link
Contributor

@DanCech DanCech commented Dec 15, 2017

Compiles and passes all tests, I haven't done any real-world testing though.

@DanCech DanCech self-assigned this Dec 15, 2017
@Dieterbe
Copy link
Contributor

we need to add a benchmark case so we can properly compare before and after this change. i can do this


log.Debug("memory-idx: series %s for orgId:%d is stale. pruning it.", n.Path, org)
defs := m.delete(org, n, true)
statMetricsActive.Dec()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to this PR, but this looks incorrect. seems like m.delete may delete several paths (children), this assumes 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the if !n.Leaf() { check on line 912 mean that there should never be children? We could add that same check here in case of a race condition where a child is added to a series that's about to be pruned, but that seems like a highly unlikely scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the if !n.Leaf() { check on line 912 mean that there should never be children?

the only thing that branch does is assure that nodes without metricdefs (data corresponding to that path) are not subject to pruning.
nodes with metricdefs (leaf nodes) are subject to pruning. I think the confusion here is because
in our implementation, a path can be both a leaf and have children.
e.g. foo.bar can have data/metricdef (making it a leaf), while foo.bar.baz also exists (making foo.bar a branch)
In graphite this doesn't work, but some of our customers wanted this behavior, and it seemed fairly easy to support, so that's why we do.

the m.delete call is recursive: it deletes the path, and any sub-paths (I think i actually just discovered a related bug here), so what we should do is track the actual amount of paths that got deleted by the delete call, or I think better: the pruning process shouldn't recursively delete, but rather only delete leaves (and clean up stale branch nodes that no longer have children)
But this warrants a new ticket, so i opened #797

@Dieterbe
Copy link
Contributor

2 comments here:

  1. metrics are no longer being pruned atomically. this may look a bit weird when looking at old data (or executing an alert query) right when a prune is ongoing, but this seems only a minor issue. not a stopper
  2. perhaps we should process toPrune in batches, to amortize lock overhead and get higher prune throughput at the expense of a slightest (should be neglible) increase in latency. e.g. pruning 100 items at once or so. a benchmark could also help with this.

@DanCech
Copy link
Contributor Author

DanCech commented Dec 20, 2017

  1. I'm not sure that I follow how this would be an issue?
  2. How many series do we actually see being pruned at any given time? If it's very high then it might be worthwhile, but I don't have any context there. I doubt it's going to be worth the added complexity.

@Dieterbe
Copy link
Contributor

  1. maybe i should have called it a minor inconvenience, not a minor issue
  2. let's not worry about it for now

@Dieterbe
Copy link
Contributor

Dieterbe commented Dec 21, 2017

we need to add a benchmark case so we can properly compare before and after this change. i can do this

i changed my mind, don't have enough time and doing a bench is not that useful here as one can obviously tell that the total duration/bandwidth should be in line (possibly a bit worse), but the big gain is obviously the much more granular index locking, which is very hard/unfeasible to convey via a benchmark, and is what we really need.

benchmark would be more useful once we make other tweaks (like shortcutting the expensive log calls), but I think the best bang for our buck now is to just merge and deploy this and we will probably not have to look at this in quite a while.

@Dieterbe Dieterbe merged commit 4cb8dac into grafana:master Dec 21, 2017
@DanCech DanCech deleted the prune-locking branch December 21, 2017 14:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants