Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metrics] add lru cache age metric #536

Closed
wants to merge 2 commits into from

Conversation

joeljeske
Copy link
Contributor

@joeljeske joeljeske commented Mar 30, 2022

This PR introduces a new metric bazel_remote_disk_cache_longest_item_idle_time_seconds that is tracks the age of the next-to-evict item using the FS atime. This allows a cache monitor to see the relative lifespan of a cache which is useful in tuning the size of the cache.

This only functions for owned disk cache items, and not items proxied to another cache or s3, but I think that is how the rest of these metrics are currently functioning.

@joeljeske joeljeske force-pushed the lru-atime-metric branch 2 times, most recently from a7bcf9b to 0c4d9ee Compare March 30, 2022 20:18
Copy link
Collaborator

@mostynb mostynb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the PR. This sounds like something worth measuring, but I have a couple of suggestions...

@@ -156,8 +157,15 @@ func New(dir string, maxSizeBytes int64, opts ...Option) (Cache, error) {

f := filepath.Join(dir, c.FileLocation(kind, value.legacy, hash, value.size, value.random))

ts, err := atime.Stat(f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try moving this into the removeFile function? It runs in another goroutine without holding the LRU index lock, and has a semaphore to limit the number of OS threads we spawn when there are eviction spikes.

Comment on lines 72 to 74
gaugeCacheLastEvictAtime: prometheus.NewGauge(prometheus.GaugeOpts{
Name: "bazel_remote_disk_cache_last_evicted_atime",
Help: "The access time of the most recently evicted item from the LRU cache.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a better metric would be the amount of time since the last evicted item was last touched? ie "now - atime of evicted item" - that way a graph of the metric over time would be a flat line when things are fairly consistent, instead of a line with a constant positive gradient.

@ulrfa
Copy link
Contributor

ulrfa commented Mar 31, 2022

I assume that most bazel-remote deployments use the ‘relatime’ mount option for performance reasons. That is also the default for linux. In other words, the filesystem atime is often updated only once per day. See mount(8) man page. I think the metric ‘now – oldest atime’ is still useful and a good idea! I think it should be reported as number of seconds, because second is the base unit for Prometheus, but with a comment referring to ‘relatime’ and explaining that the accuracy is typically days.

However I would like to avoid the extra overhead of a file system stat operation whenever a file is evicted, since file system accesses are one of the main bottlenecks in my deployments. And I think it is overkill considering the typical accuracy of number of days. What do you think about having a periodical task scheduled once per minute, that could stat the oldest entry in the lru and update a prometheus gauge?

@joeljeske
Copy link
Contributor Author

Thanks @ulrfa and @mostynb, Those are great and helpful notes.

Based on the concern of putting the atime in the critical path, I re-worked the guage to get the atime of the next-to-evict item on-demand during a metric scrape, thus always being up to date but not overly stat'ing files for every eviction.

Since this metric is always kept up to date and never potentially stale, I was able to incorporate the now() - atime for an easier metric. (I had been using time() - metric in my charting but this is better now, I agree.)

@mostynb
Copy link
Collaborator

mostynb commented Mar 31, 2022

What if instead of now() - atime we tracked now() - ctime ? We'd be tracking how long items have been in the cache rather than how long they've been unused while in the cache. When the cache is undersized they'd give the similar results IIUC, and we wouldn't need to worry about noatime or relatime settings. When the cache is appropriately sized you might still see the occasional low value when unused items pass through the cache, as long as the value isn't low very frequently you wouldn't need to worry.

Also, how frequently are prometheus gauges sampled?

@joeljeske
Copy link
Contributor Author

ctime is certainly a valuable metric as fallback to atime.

If atime is available, I think it is a much more valuable metric. Since the LRU should ideally be sorted exactly by atime, it gives a good representation of the age of the cache.

Ctime would not be near as valuable as it might often show a much older cache time and be quite a bit jumpier. It could easily have large swings in value.

I've be considering the primary use case of a cache that is at size and therefore actively evicting as new items come in. I do think this makes the most sense as the use case to optimize for.

Would you like to perhaps add both ctime and atime metrics, or maybe configure it behind a flag?

The Prometheus gauge func is only called during an active scrape, so it depends on the frequency of the scraper, which is outside of scope of the client providing the metric. The default scrape interval is 30s I believe.

@ulrfa
Copy link
Contributor

ulrfa commented Apr 1, 2022

I think the mu mutex in disk.go must be locked while accessing the lru data structure. And not be released until after the stat system call finish, in order to prevent the file from being deleted between the lru data access and the stat call.

Then we also need to avoid any risk for deadlock between our mu mutex and other potential locks in the prometheus library. Is the prometheus library making any explicit guarantee that it is not holding any internal locks while invoking our gauge callback function? Even if prometheus would claim such guarantee today, I would sleep better if we do not rely on it. I think the easiest way to avoid risk for deadlock is to have our own periodical task scheduled once per minute, instead of callback from prometheus.

@ulrfa
Copy link
Contributor

ulrfa commented Apr 1, 2022

Ctime would not be near as valuable as it might often show a much older cache time and be quite a bit jumpier. It could easily have large swings in value.

I also prefer atime instead of ctime, for that reason.

Would you like to perhaps add both ctime and atime metrics, or maybe configure it behind a flag?

With relatime mount option, the atime is always updated at least when ctime is updated, and then I don't see a use case for ctime for this metric.

I would not recommend deploying bazel-remote with noatime mount option, because bazel-remote use the atime when reloading lru after restart, but if someone is doing that and also are interested in this metric, then perhaps that could be a future improvement outside of this pull request?

What do you think @mostynb?

@joeljeske
Copy link
Contributor Author

Great, thanks for the direction.

I updated the flow back to a standard metric that updates once a minute, and I ensured we lock the lru mu during that process. (Given this constraint, it made more sense to me to push the metric into disk itself instead of lru.)

I think this metric will be very valuable in helping users discover the proper sizing for their cache.

Please let me know if there is anything else you would like to see changed :-)

@joeljeske joeljeske changed the title [metrics] add lru evict atime metric [metrics] add lru cache age metric Apr 1, 2022
@ulrfa
Copy link
Contributor

ulrfa commented Apr 3, 2022

Thanks @joeljeske, implementation looks good to me, and I think this metric is useful.

Let's see what @mostynb say.

@joeljeske joeljeske force-pushed the lru-atime-metric branch 2 times, most recently from 0531a74 to 37b3143 Compare April 4, 2022 15:41
@joeljeske
Copy link
Contributor Author

@mostynb should be ready for review now. This last push was to green up tests; I had missed the "empty cache" case causing an error upon initial startup but everything is green now.

Thanks!

@mostynb
Copy link
Collaborator

mostynb commented Apr 4, 2022

Sorry it took me a while to get back to this PR.

Would you like to perhaps add both ctime and atime metrics, or maybe configure it behind a flag?

If possible, I would prefer to settle on a single metric and avoid adding another flag.

If atime is available, I think it is a much more valuable metric. Since the LRU should ideally be sorted exactly by atime, it gives a good representation of the age of the cache.

If we have an accurate atime, then that would give a nice accurate metric. But my concern is that "now - atime" is useless if using noatime, and possibly too coarse (1 day) if using relatime with a high-churn cache. I can imagine that some long-running bazel-remote instances might be using the noatime mount option, because we currently only use the atime on startup, not during normal operation of the cache.

If we used "now - ctime" we'd still have a meaningful and fairly accurate value when using either noatime or relatime mount options, at the cost of a more confusing alert condition on the metric: you'd need to watch for situations where the oldest cache item is frequently not very old (and how do you define "frequently"?).

Alternatively, if we stored the last access time in the LRU index, it would always be accurate regardless of the filesystem mount options except immediately after cache startup, since it's initialised with atime. I have one cache with about 10 million items, if we stored int64 unix times in the cache that would add 76M to the index (ignoring potential struct-packing overhead). That's not an unreasonable amount of extra memory usage to add. I'm unsure if I I would prefer this over the other options.

I will add some comments to the code, but I'd like to think through the options metioned above a bit longer before merging this (both of your opinions are welcome on this).

@@ -280,3 +280,13 @@ func (c *SizedLRU) removeElement(e *list.Element) {
func roundUp4k(n int64) int64 {
return (n + BlockSize - 1) & -BlockSize
}

// Get the back item of the LRU cache.
func (c *SizedLRU) PeakOldestCacheItem() (Key, lruItem) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used outside the disk package, so it doesn't need to be exported. How about calling it getTailItem, since the least recently used item might not be the oldest item in the index (and we already used lruItem for the element type so getLRUItem might be confusing) ?

Comment on lines 145 to 146
Name: "bazel_remote_disk_cache_age_seconds",
Help: "The file `atime` of oldest item in the LRU cache. Depending on filemount opts (e.g. relatime), the resolution may be meausured in 'days' and not accurate to the second",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "age" is the right word to describe "now - atime" of the least recently used item. If we do end up using atime, what do you think of the term "longest item idle time"?

go c.pollCacheAge()
}

// Update metric every minute with the current age of the cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: "Update metric every minute with the idle time of the least recently used item in the cache" (assuming we stick with "now - atime")

}
}

// Get the back item in the cache, and store its atime in a metric
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: "Get the idle time of the least-recently used item in the cache, and store the value in a metric" (assuming we stick with "now - atime")

ts, err := atime.Stat(f)

if err != nil {
log.Printf("ERROR: failed to determine cache age: %v, unable to stat %s", err, f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the case, we'll set the metric to 0.0 below, which might trigger alarms. I guess we should set a bool here, and only update the metric below if the bool is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats fine with me to only update it on success. I was thinking that if the cache was consistently unable to stat the file then we would not want to keep broadcasting a potentially very old previous value. I was thinking that most would use an average metric over time (30m?) for something like this and occasional and rare 0 values would be naturally filtered out.

@joeljeske
Copy link
Contributor Author

Thank you for the feedback and thoughtful responses.

atime vs ctime

Let me explain my hesitations of ctime

In a bazel managed repo & cache, some actions and cache entries are rather static, whereas some are frequently changing. Consider rules_go, where the go compiler wrapper itself is compiled and cached as an action within bazel. This action result item should be frequently accessed within every build and thus never be evicted. In a right-sized cache, it will only be evicted when rules_go is upgraded or perhaps bazel itself is upgraded. At that time, it will be eventually be evicted as it is no longer accessed. When it becomes the next-to-evict item, this item will have a very old ctime, but a recently accessed atime. In a given bazel managed repo, there likely many actions that fall along the spectrum of "frequently invalided" to "never invalidated". Even consider a leaf package in this repo such as //utils/zstdpool, it was last changed Dec/2020. The next time it is touched, its cache item will eventually become evicted and thus represent an unusually large "age".

Due to this, I would think that if measuring ctime it would be very difficult to get an accurate view of age of the least-recently-accessed item.

FS atime or user-land atime

A user-land atime could be a fine solution, however, I was concerned with the overhead for folks that do not care about this metric. I don't have a great view of the average consumers' size of cache & number of items in cache. This could be costly so I'll defer that decision to you. I had seen some recent issues regarding mem size and I did not want to contribute to that :)

With FS managed atime, I would first argue that relatime would not result in too-coarse of data. With Bazel, likely a right-sized cache would be on the order of 1 week of cached content. Dropping too much lower than that likely would result in active items being evicted prematurely. If you are only retaining 1 day of content, likely there is a sizable ratio of items that being re-computed to populated previously evicted cache entries. Additionally, when items are being populated and evicted in less than 1 day, the atime would reflect the ctime and still be considered a valuable datapoint. This is all variable depending on installation needs and subject to the age-gap between various source control branches, however, I would still reason that 1day of coarseness would be an acceptable trade-off.

Lastly, since this is an additional metric and not necessarily a piece of core functionality, it could be documented that noatime is not compatible with this metric and that this metric reports 0 when noatime is set. It seems like a proportionate trade-off to require at minimum relatime in order to leverage this metric.

@joeljeske
Copy link
Contributor Author

I'll wait to update with review comments until a strategy forward is decided 👍

@ulrfa
Copy link
Contributor

ulrfa commented Apr 5, 2022

confusing alert condition on the metric: you'd need to watch for situations where the oldest cache item is frequently not very old (and how do you define "frequently"?).

Yes, that would be confusing and hard to define "frequently". For me it feels hard to draw conclusions based on ctime.

Due to this, I would think that if measuring ctime it would be very difficult to get an accurate view of age of the least-recently-accessed item.

I agree.

Alternatively, if we stored the last access time in the LRU index, it would always be accurate regardless of the filesystem mount options except immediately after cache startup, since it's initialised with atime.

I guess the overhead (request latency and memory consumption) would be insignificant for most scenarios, even for FindMissingBlobs where a single request could affect lots of entries. But I don't know for sure and would prefer keeping it as lean as possible without additional information in the LRU index.

It could also be confusing if this metric suddenly changes by just restarting bazel-remote. (Trying to store LRU index atime persistently would cause unnecessary complexity in my opinion).

it could be documented that noatime is not compatible with this metric and that this metric reports 0 when noatime is set. It seems like a proportionate trade-off to require at minimum relatime in order to leverage this metric.

Sounds reasonable to me.

and possibly too coarse (1 day) if using relatime with a high-churn cache.

For my use cases it is enough with 1 day accuracy.

I now checked two of my cache instances, the oldest atime was about 4 days ago.

@joeljeske
Copy link
Contributor Author

It could also be confusing if this metric suddenly changes by just restarting bazel-remote. (Trying to store LRU index atime persistently would cause unnecessary complexity in my opinion).

This is a very good point, and valid enough for me to feel strongly against user-land atime

What are your thoughts @mostynb?

@mostynb
Copy link
Collaborator

mostynb commented Apr 7, 2022

What are your thoughts @mostynb?

OK, let's go with "now - atime" and document the limitations.

@joeljeske
Copy link
Contributor Author

Thanks for the guidance here @mostynb and @ulrfa.

The requested updates have been made and ready for review. Thanks!

Copy link
Collaborator

@mostynb mostynb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, this can land once the metric description has been fixed.

cache/disk/disk.go Outdated Show resolved Hide resolved
@mostynb
Copy link
Collaborator

mostynb commented Apr 8, 2022

Thanks for persevering with this- merged with the documentation update + more detailed commit message.

@mostynb mostynb closed this Apr 8, 2022
@joeljeske
Copy link
Contributor Author

Thanks so much! I hope it's useful. What is your typical release cadence, as to use this metric?

@mostynb
Copy link
Collaborator

mostynb commented Apr 8, 2022

Development isn't active enough at the moment to have a regular release frequency, so it depends a bit on how important the updates are. I'll try to take a look at #539 over the weekend (didn't manage to get to it this week) and see it it's worth trying to land that before making the next release.

@ulrfa
Copy link
Contributor

ulrfa commented Apr 12, 2022

Thanks @joeljeske and @mostynb!

@joeljeske
Copy link
Contributor Author

@mostynb do you think we could cut a new release soon? 😃

@mostynb
Copy link
Collaborator

mostynb commented Apr 24, 2022

Released v2.3.7 with this change. Sorry for the delay. Next up: #542 and #539.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants