[metrics] add lru cache age metric #536

joeljeske · 2022-03-30T20:02:23Z

This PR introduces a new metric bazel_remote_disk_cache_longest_item_idle_time_seconds that is tracks the age of the next-to-evict item using the FS atime. This allows a cache monitor to see the relative lifespan of a cache which is useful in tuning the size of the cache.

This only functions for owned disk cache items, and not items proxied to another cache or s3, but I think that is how the rest of these metrics are currently functioning.

mostynb

Hi, thanks for the PR. This sounds like something worth measuring, but I have a couple of suggestions...

mostynb · 2022-03-30T21:53:24Z

cache/disk/disk.go

@@ -156,8 +157,15 @@ func New(dir string, maxSizeBytes int64, opts ...Option) (Cache, error) {

 		f := filepath.Join(dir, c.FileLocation(kind, value.legacy, hash, value.size, value.random))

+		ts, err := atime.Stat(f)


Could you try moving this into the removeFile function? It runs in another goroutine without holding the LRU index lock, and has a semaphore to limit the number of OS threads we spawn when there are eviction spikes.

mostynb · 2022-03-30T22:09:03Z

cache/disk/lru.go

+		gaugeCacheLastEvictAtime: prometheus.NewGauge(prometheus.GaugeOpts{
+			Name: "bazel_remote_disk_cache_last_evicted_atime",
+			Help: "The access time of the most recently evicted item from the LRU cache.",


I wonder if a better metric would be the amount of time since the last evicted item was last touched? ie "now - atime of evicted item" - that way a graph of the metric over time would be a flat line when things are fairly consistent, instead of a line with a constant positive gradient.

ulrfa · 2022-03-31T07:10:24Z

I assume that most bazel-remote deployments use the ‘relatime’ mount option for performance reasons. That is also the default for linux. In other words, the filesystem atime is often updated only once per day. See mount(8) man page. I think the metric ‘now – oldest atime’ is still useful and a good idea! I think it should be reported as number of seconds, because second is the base unit for Prometheus, but with a comment referring to ‘relatime’ and explaining that the accuracy is typically days.

However I would like to avoid the extra overhead of a file system stat operation whenever a file is evicted, since file system accesses are one of the main bottlenecks in my deployments. And I think it is overkill considering the typical accuracy of number of days. What do you think about having a periodical task scheduled once per minute, that could stat the oldest entry in the lru and update a prometheus gauge?

joeljeske · 2022-03-31T15:24:50Z

Thanks @ulrfa and @mostynb, Those are great and helpful notes.

Based on the concern of putting the atime in the critical path, I re-worked the guage to get the atime of the next-to-evict item on-demand during a metric scrape, thus always being up to date but not overly stat'ing files for every eviction.

Since this metric is always kept up to date and never potentially stale, I was able to incorporate the now() - atime for an easier metric. (I had been using time() - metric in my charting but this is better now, I agree.)

mostynb · 2022-03-31T22:28:17Z

What if instead of now() - atime we tracked now() - ctime ? We'd be tracking how long items have been in the cache rather than how long they've been unused while in the cache. When the cache is undersized they'd give the similar results IIUC, and we wouldn't need to worry about noatime or relatime settings. When the cache is appropriately sized you might still see the occasional low value when unused items pass through the cache, as long as the value isn't low very frequently you wouldn't need to worry.

Also, how frequently are prometheus gauges sampled?

joeljeske · 2022-04-01T01:26:10Z

ctime is certainly a valuable metric as fallback to atime.

If atime is available, I think it is a much more valuable metric. Since the LRU should ideally be sorted exactly by atime, it gives a good representation of the age of the cache.

Ctime would not be near as valuable as it might often show a much older cache time and be quite a bit jumpier. It could easily have large swings in value.

I've be considering the primary use case of a cache that is at size and therefore actively evicting as new items come in. I do think this makes the most sense as the use case to optimize for.

Would you like to perhaps add both ctime and atime metrics, or maybe configure it behind a flag?

The Prometheus gauge func is only called during an active scrape, so it depends on the frequency of the scraper, which is outside of scope of the client providing the metric. The default scrape interval is 30s I believe.

ulrfa · 2022-04-01T08:42:35Z

I think the mu mutex in disk.go must be locked while accessing the lru data structure. And not be released until after the stat system call finish, in order to prevent the file from being deleted between the lru data access and the stat call.

Then we also need to avoid any risk for deadlock between our mu mutex and other potential locks in the prometheus library. Is the prometheus library making any explicit guarantee that it is not holding any internal locks while invoking our gauge callback function? Even if prometheus would claim such guarantee today, I would sleep better if we do not rely on it. I think the easiest way to avoid risk for deadlock is to have our own periodical task scheduled once per minute, instead of callback from prometheus.

ulrfa · 2022-04-01T08:59:06Z

Ctime would not be near as valuable as it might often show a much older cache time and be quite a bit jumpier. It could easily have large swings in value.

I also prefer atime instead of ctime, for that reason.

Would you like to perhaps add both ctime and atime metrics, or maybe configure it behind a flag?

With relatime mount option, the atime is always updated at least when ctime is updated, and then I don't see a use case for ctime for this metric.

I would not recommend deploying bazel-remote with noatime mount option, because bazel-remote use the atime when reloading lru after restart, but if someone is doing that and also are interested in this metric, then perhaps that could be a future improvement outside of this pull request?

What do you think @mostynb?

joeljeske · 2022-04-01T14:42:44Z

Great, thanks for the direction.

I updated the flow back to a standard metric that updates once a minute, and I ensured we lock the lru mu during that process. (Given this constraint, it made more sense to me to push the metric into disk itself instead of lru.)

I think this metric will be very valuable in helping users discover the proper sizing for their cache.

Please let me know if there is anything else you would like to see changed :-)

ulrfa · 2022-04-03T10:04:16Z

Thanks @joeljeske, implementation looks good to me, and I think this metric is useful.

Let's see what @mostynb say.

joeljeske · 2022-04-04T15:48:20Z

@mostynb should be ready for review now. This last push was to green up tests; I had missed the "empty cache" case causing an error upon initial startup but everything is green now.

Thanks!

mostynb · 2022-04-04T22:37:07Z

Sorry it took me a while to get back to this PR.

Would you like to perhaps add both ctime and atime metrics, or maybe configure it behind a flag?

If possible, I would prefer to settle on a single metric and avoid adding another flag.

If atime is available, I think it is a much more valuable metric. Since the LRU should ideally be sorted exactly by atime, it gives a good representation of the age of the cache.

If we have an accurate atime, then that would give a nice accurate metric. But my concern is that "now - atime" is useless if using noatime, and possibly too coarse (1 day) if using relatime with a high-churn cache. I can imagine that some long-running bazel-remote instances might be using the noatime mount option, because we currently only use the atime on startup, not during normal operation of the cache.

If we used "now - ctime" we'd still have a meaningful and fairly accurate value when using either noatime or relatime mount options, at the cost of a more confusing alert condition on the metric: you'd need to watch for situations where the oldest cache item is frequently not very old (and how do you define "frequently"?).

Alternatively, if we stored the last access time in the LRU index, it would always be accurate regardless of the filesystem mount options except immediately after cache startup, since it's initialised with atime. I have one cache with about 10 million items, if we stored int64 unix times in the cache that would add 76M to the index (ignoring potential struct-packing overhead). That's not an unreasonable amount of extra memory usage to add. I'm unsure if I I would prefer this over the other options.

I will add some comments to the code, but I'd like to think through the options metioned above a bit longer before merging this (both of your opinions are welcome on this).

mostynb · 2022-04-04T21:27:08Z

cache/disk/lru.go

@@ -280,3 +280,13 @@ func (c *SizedLRU) removeElement(e *list.Element) {
 func roundUp4k(n int64) int64 {
 	return (n + BlockSize - 1) & -BlockSize
 }
+
+// Get the back item of the LRU cache.
+func (c *SizedLRU) PeakOldestCacheItem() (Key, lruItem) {


This doesn't seem to be used outside the disk package, so it doesn't need to be exported. How about calling it getTailItem, since the least recently used item might not be the oldest item in the index (and we already used lruItem for the element type so getLRUItem might be confusing) ?

mostynb · 2022-04-04T21:43:20Z

cache/disk/disk.go

+			Name: "bazel_remote_disk_cache_age_seconds",
+			Help: "The file `atime` of oldest item in the LRU cache. Depending on filemount opts (e.g. relatime), the resolution may be meausured in 'days' and not accurate to the second",


I don't think "age" is the right word to describe "now - atime" of the least recently used item. If we do end up using atime, what do you think of the term "longest item idle time"?

mostynb · 2022-04-04T22:38:30Z

cache/disk/disk.go

+	go c.pollCacheAge()
+}
+
+// Update metric every minute with the current age of the cache


nitpick: "Update metric every minute with the idle time of the least recently used item in the cache" (assuming we stick with "now - atime")

mostynb · 2022-04-04T22:38:58Z

cache/disk/disk.go

+	}
+}
+
+// Get the back item in the cache, and store its atime in a metric


nitpick: "Get the idle time of the least-recently used item in the cache, and store the value in a metric" (assuming we stick with "now - atime")

mostynb · 2022-04-04T22:40:14Z

cache/disk/disk.go

+		ts, err := atime.Stat(f)
+
+		if err != nil {
+			log.Printf("ERROR: failed to determine cache age: %v, unable to stat %s", err, f)


If this is the case, we'll set the metric to 0.0 below, which might trigger alarms. I guess we should set a bool here, and only update the metric below if the bool is true.

Thats fine with me to only update it on success. I was thinking that if the cache was consistently unable to stat the file then we would not want to keep broadcasting a potentially very old previous value. I was thinking that most would use an average metric over time (30m?) for something like this and occasional and rare 0 values would be naturally filtered out.

joeljeske · 2022-04-05T14:30:20Z

Thank you for the feedback and thoughtful responses.

atime vs ctime

Let me explain my hesitations of ctime

In a bazel managed repo & cache, some actions and cache entries are rather static, whereas some are frequently changing. Consider rules_go, where the go compiler wrapper itself is compiled and cached as an action within bazel. This action result item should be frequently accessed within every build and thus never be evicted. In a right-sized cache, it will only be evicted when rules_go is upgraded or perhaps bazel itself is upgraded. At that time, it will be eventually be evicted as it is no longer accessed. When it becomes the next-to-evict item, this item will have a very old ctime, but a recently accessed atime. In a given bazel managed repo, there likely many actions that fall along the spectrum of "frequently invalided" to "never invalidated". Even consider a leaf package in this repo such as //utils/zstdpool, it was last changed Dec/2020. The next time it is touched, its cache item will eventually become evicted and thus represent an unusually large "age".

Due to this, I would think that if measuring ctime it would be very difficult to get an accurate view of age of the least-recently-accessed item.

FS atime or user-land atime

A user-land atime could be a fine solution, however, I was concerned with the overhead for folks that do not care about this metric. I don't have a great view of the average consumers' size of cache & number of items in cache. This could be costly so I'll defer that decision to you. I had seen some recent issues regarding mem size and I did not want to contribute to that :)

With FS managed atime, I would first argue that relatime would not result in too-coarse of data. With Bazel, likely a right-sized cache would be on the order of 1 week of cached content. Dropping too much lower than that likely would result in active items being evicted prematurely. If you are only retaining 1 day of content, likely there is a sizable ratio of items that being re-computed to populated previously evicted cache entries. Additionally, when items are being populated and evicted in less than 1 day, the atime would reflect the ctime and still be considered a valuable datapoint. This is all variable depending on installation needs and subject to the age-gap between various source control branches, however, I would still reason that 1day of coarseness would be an acceptable trade-off.

Lastly, since this is an additional metric and not necessarily a piece of core functionality, it could be documented that noatime is not compatible with this metric and that this metric reports 0 when noatime is set. It seems like a proportionate trade-off to require at minimum relatime in order to leverage this metric.

joeljeske · 2022-04-05T14:30:46Z

I'll wait to update with review comments until a strategy forward is decided 👍

ulrfa · 2022-04-05T16:02:39Z

confusing alert condition on the metric: you'd need to watch for situations where the oldest cache item is frequently not very old (and how do you define "frequently"?).

Yes, that would be confusing and hard to define "frequently". For me it feels hard to draw conclusions based on ctime.

Due to this, I would think that if measuring ctime it would be very difficult to get an accurate view of age of the least-recently-accessed item.

I agree.

Alternatively, if we stored the last access time in the LRU index, it would always be accurate regardless of the filesystem mount options except immediately after cache startup, since it's initialised with atime.

I guess the overhead (request latency and memory consumption) would be insignificant for most scenarios, even for FindMissingBlobs where a single request could affect lots of entries. But I don't know for sure and would prefer keeping it as lean as possible without additional information in the LRU index.

It could also be confusing if this metric suddenly changes by just restarting bazel-remote. (Trying to store LRU index atime persistently would cause unnecessary complexity in my opinion).

it could be documented that noatime is not compatible with this metric and that this metric reports 0 when noatime is set. It seems like a proportionate trade-off to require at minimum relatime in order to leverage this metric.

Sounds reasonable to me.

and possibly too coarse (1 day) if using relatime with a high-churn cache.

For my use cases it is enough with 1 day accuracy.

I now checked two of my cache instances, the oldest atime was about 4 days ago.

joeljeske · 2022-04-07T13:58:57Z

It could also be confusing if this metric suddenly changes by just restarting bazel-remote. (Trying to store LRU index atime persistently would cause unnecessary complexity in my opinion).

This is a very good point, and valid enough for me to feel strongly against user-land atime

What are your thoughts @mostynb?

mostynb · 2022-04-07T22:36:15Z

What are your thoughts @mostynb?

OK, let's go with "now - atime" and document the limitations.

joeljeske · 2022-04-08T17:16:24Z

Thanks for the guidance here @mostynb and @ulrfa.

The requested updates have been made and ready for review. Thanks!

mostynb

Looks good, this can land once the metric description has been fixed.

cache/disk/disk.go

mostynb · 2022-04-08T19:08:30Z

Thanks for persevering with this- merged with the documentation update + more detailed commit message.

joeljeske · 2022-04-08T19:54:25Z

Thanks so much! I hope it's useful. What is your typical release cadence, as to use this metric?

mostynb · 2022-04-08T22:26:10Z

Development isn't active enough at the moment to have a regular release frequency, so it depends a bit on how important the updates are. I'll try to take a look at #539 over the weekend (didn't manage to get to it this week) and see it it's worth trying to land that before making the next release.

ulrfa · 2022-04-12T19:29:26Z

Thanks @joeljeske and @mostynb!

joeljeske · 2022-04-22T14:35:06Z

@mostynb do you think we could cut a new release soon? 😃

mostynb · 2022-04-24T21:55:03Z

Released v2.3.7 with this change. Sorry for the delay. Next up: #542 and #539.

joeljeske force-pushed the lru-atime-metric branch 2 times, most recently from a7bcf9b to 0c4d9ee Compare March 30, 2022 20:18

mostynb requested changes Mar 30, 2022

View reviewed changes

joeljeske force-pushed the lru-atime-metric branch from 0c4d9ee to cb79ed1 Compare March 31, 2022 15:00

joeljeske force-pushed the lru-atime-metric branch from cb79ed1 to 477a5d6 Compare April 1, 2022 14:37

joeljeske changed the title ~~[metrics] add lru evict atime metric~~ [metrics] add lru cache age metric Apr 1, 2022

joeljeske force-pushed the lru-atime-metric branch from 477a5d6 to ed49e8f Compare April 1, 2022 19:08

joeljeske force-pushed the lru-atime-metric branch 2 times, most recently from 0531a74 to 37b3143 Compare April 4, 2022 15:41

mostynb reviewed Apr 4, 2022

View reviewed changes

joeljeske force-pushed the lru-atime-metric branch from 37b3143 to 904e79a Compare April 8, 2022 16:52

[metrics] add lru evict atime metric

1a5a3da

joeljeske force-pushed the lru-atime-metric branch from 904e79a to 1a5a3da Compare April 8, 2022 18:11

mostynb reviewed Apr 8, 2022

View reviewed changes

cache/disk/disk.go Outdated Show resolved Hide resolved

Update cache/disk/disk.go

a84666c

mostynb closed this Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metrics] add lru cache age metric #536

[metrics] add lru cache age metric #536

joeljeske commented Mar 30, 2022 •

edited

Loading

mostynb left a comment

mostynb Mar 30, 2022

mostynb Mar 30, 2022

ulrfa commented Mar 31, 2022 •

edited

Loading

joeljeske commented Mar 31, 2022

mostynb commented Mar 31, 2022

joeljeske commented Apr 1, 2022

ulrfa commented Apr 1, 2022

ulrfa commented Apr 1, 2022

joeljeske commented Apr 1, 2022

ulrfa commented Apr 3, 2022

joeljeske commented Apr 4, 2022

mostynb commented Apr 4, 2022

mostynb Apr 4, 2022

mostynb Apr 4, 2022

mostynb Apr 4, 2022

mostynb Apr 4, 2022

mostynb Apr 4, 2022

joeljeske Apr 5, 2022

joeljeske commented Apr 5, 2022

joeljeske commented Apr 5, 2022

ulrfa commented Apr 5, 2022

joeljeske commented Apr 7, 2022

mostynb commented Apr 7, 2022

joeljeske commented Apr 8, 2022

mostynb left a comment

mostynb commented Apr 8, 2022

joeljeske commented Apr 8, 2022

mostynb commented Apr 8, 2022

ulrfa commented Apr 12, 2022

joeljeske commented Apr 22, 2022

mostynb commented Apr 24, 2022

		@@ -156,8 +157,15 @@ func New(dir string, maxSizeBytes int64, opts ...Option) (Cache, error) {

		f := filepath.Join(dir, c.FileLocation(kind, value.legacy, hash, value.size, value.random))

		ts, err := atime.Stat(f)

		Name: "bazel_remote_disk_cache_age_seconds",
		Help: "The file `atime` of oldest item in the LRU cache. Depending on filemount opts (e.g. relatime), the resolution may be meausured in 'days' and not accurate to the second",

[metrics] add lru cache age metric #536

[metrics] add lru cache age metric #536

Conversation

joeljeske commented Mar 30, 2022 • edited Loading

mostynb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulrfa commented Mar 31, 2022 • edited Loading

joeljeske commented Mar 31, 2022

mostynb commented Mar 31, 2022

joeljeske commented Apr 1, 2022

ulrfa commented Apr 1, 2022

ulrfa commented Apr 1, 2022

joeljeske commented Apr 1, 2022

ulrfa commented Apr 3, 2022

joeljeske commented Apr 4, 2022

mostynb commented Apr 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joeljeske commented Apr 5, 2022

joeljeske commented Apr 5, 2022

ulrfa commented Apr 5, 2022

joeljeske commented Apr 7, 2022

mostynb commented Apr 7, 2022

joeljeske commented Apr 8, 2022

mostynb left a comment

Choose a reason for hiding this comment

mostynb commented Apr 8, 2022

joeljeske commented Apr 8, 2022

mostynb commented Apr 8, 2022

ulrfa commented Apr 12, 2022

joeljeske commented Apr 22, 2022

mostynb commented Apr 24, 2022

joeljeske commented Mar 30, 2022 •

edited

Loading

ulrfa commented Mar 31, 2022 •

edited

Loading