-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConcurrentLfu time-based expiry #516
Conversation
fwiw, I implemented expireAfterAccess and expireAfterWrite prior to having the timing wheel. Otherwise I’d likely not have bothered and made them convenience builder methods that set the expireAfter policy. Instead I didn’t redo working code and left both approaches. Just fyi so you don’t think it was a performance need or something, I simply hadn’t known how to do the variable policy originally in a way that fit my design goals. |
Thanks for the insight, it makes total sense. I'm figuring out how to fit this together - from reading your code in Caffeine you generate a Node class with the required backing fields for each policy to minimize storage overhead. For expireAfterAccess, since each of window, probation and protected are already in access order it seems like it will be easy to walk the existing lists from the LRU position until a non-expired item is found (I think you had suggested this last year so I had it in my mind this part is much easier to implement). This requires a node type with 1 additional field for the expiry time (so the overhead is prev+next+time). For expireAfterWrite and expireAfter another linked list is required for time order, so I planned to define a node type with an extra prev/next ptr (so the overhead is prev+next+time+timeprev+timenext) - last night I figured out these can both use the same node class but I hadn't made the mental leap that both could use the timing wheel which would indeed be simpler. It seems like it's still worth implementing expireAfterAccess separately to piggyback on the existing access order lists and compact the node size. |
A caveat for using the access order lists is that the read buffer is allowed to drop events. That could delay the eviction of the expired entry if the head is pinned to a non-expired item, as this user experienced. This shouldn't typically be a problem if the size is bounded or the workload varies, but if the user unintentionally pins the entry then there is unbounded growth. Of course the lossiness of the read buffer is crucial for concurrent read performance, whereas writes are not lossy for correctness so In that user's scenario the solution was to switch to the timing wheel based approach. This corrected the problem because the wheels turn independently and, if it encounters a misplaced entry, it will reschedule the entry rather than halt the expiration early. That makes it a more robust of an algorithm for malicious workloads. I think that the vast majority of expiration should use |
I hadn't anticipated that. I had thought the opposite might occur - the dropped reads would fail to update the access order+expiry time of a Node, and it would be erroneously expired due to having a stale time stamp. Most likely I'm thinking of doing something that won't actually work in practice, and I didn't figure it out yet. Thanks for all these gems of wisdom. I'm still getting my head around the timer wheel algorithm and before I get too deep into that I need to do a few rounds of changes so I can add this cleanly. |
I probably updated the timestamp on the request since the entry would be buffered for some unknown amount of time and figured that fuzziness of the replay order wouldn't skew it too badly. The timer wheel is neat but confusing, and I'm not sure how much of that is inherent or my fault. I didn't have a code reference and, since I wanted to make it as fast as I could for fun, I avoid expensive division / modulus for power-of-two time spans which makes the code harder to follow. I later found other implementations and most seem to use a chain-of-responsibility by having wheel objects where items flow downward. Those seemed to be more generic / flexible, but scatter the algorithm across multiple files and perhaps do extra work. I guess at the cost of being brittle and confusing, mine is crazy fast and perfect for what I needed. I'll be interested to see if you'd approach it differently once you get comfortable with it all. |
It's good to start out with the fast version and I expect I will learn a lot from it. Your frequency sketch code translated to C# really well and that was also a fun exercise. I will try to get deeper into it this week, there are many questions in my mind about how to get this working, and I first need to retrofit the generic node type into ConcurrentLfu without making a mess. I need some focus time to analyze TimerWheel, it looks very interesting and I also want to read the paper you linked to in your source code for background. |
if ((node.GetTimestamp() - time) < 0) | ||
{ | ||
cache.Evict(node); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi, prior to the loop the linked list is detached so I rescheduled the node if not expired. Presumably that could be due to a concurrent update so the reordering had not happened yet or similar scenarios. If you don't reschedule then detaching is probably unnecessary. If you do then it avoids an accidental loop, e.g. the overflow bucket (1+ weeks) might scan over items that expire far into the distant future so it would reschedule back into it. Or maybe if it reschedules at the head instead of tail that would resolves it. Just something to review and consider how you prefer handling these scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! You probably just saved me several hours of debugging - so far I had only tested superficially by replicating some of your unit tests and single stepping in the debugger to understand how it works. Now you pointed this out, it seems that I must call schedule here, otherwise nodes would always stay in the same bucket and never cascade down through the wheels. I had distracted myself trying to understand under what conditions a resurrection could occur and missed it.
I'm not sure if this is very useful to you at this point, but I don't think you have a unit test that would detect my mistake (at least in TimerWheelTest.java), since the only test you have calling advance
more than once is advance_backwards
.
I think I have mapped out how to fit everything together, but I need to do a lot more testing. You made a great point in that all time-based policies can be implemented via TimerWheel, so I will start out with that and add the more specialist after access policy later as an optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! I'll look into adding an explicit unit test for this. I hope that it is implicitly covered indirectly by other tests due to the data integrity checks that are performed after a cache test passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspected you had it covered - that's effectively what I would like to do next, I will make both a higher-level functional test suite and something that is more like concurrent/stress with constant expiry and check integrity - so good to have that link for inspiration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ported your test case, works great! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ported your test case, works great! 😄
It was more or less a copy paste of your other test. It wasn't correct for me - since I am not using nano seconds (Duration can use different time sources with different units to get better perf on different platforms), I had to adjust it a bit to get nodes into each wheel.
I have now added a way to assert on the position and print out the wheel within the tests, next job is the schedule tests. Likely it is completely working with what I have, but I am still poking around to understand it.
Baseline
With time-based expiry changes:
|
Last piece missing is to defend against frequent updates saturating the write queue. E.g. if the item is updated within 1 second or whatever, don't enqueue/reschedule. |
{ | ||
var currentExpiry = node.TimeToExpire - current; | ||
node.TimeToExpire = current + expiryCalculator.GetExpireAfterRead(node.Key, node.Value, currentExpiry); | ||
wheel.Reschedule(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a guard here to avoid rescheduling the same node into the same bucket if it was read twice or more, or if it was very recently read.
{ | ||
var currentExpiry = node.TimeToExpire - current; | ||
node.TimeToExpire = current + expiryCalculator.GetExpireAfterUpdate(node.Key, node.Value, currentExpiry); | ||
wheel.Reschedule(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a guard here to avoid rescheduling the same node into the same bucket if it was updated twice or more, or if it was very recently changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simplest cheap thing is to compare currentExpiry to node.TimeToExpire, and only reschedule if expiry has changed. But we can still execute the expiryCalculator many times for the same item, in the same maintenance cycle.
One strategy is to store the time scheduled in each node (either count each time maintenance runs, or store the timestamp), and only re-schedule when the time has changed (or has changed by more than some delta like 15ms, the worst-case clock resolution).
Both AccessOrderNode and TimeOrderNode have 2 bytes of padding when the key and value are a reference type, so using this padding we could count 65,535 update cycles without increasing node memory size. Storing an additional long would give more accuracy but will increase size by 8 bytes.
If a node is not rescheduled (assuming approximate time using the method above and a node is not touched for 65,535 update cycles) and has expired, it will remain in the timer wheel until it is accessed or the bucket it is in is expired. That is only a problem if the expiry calculator can generate an expiry far in the future, then a much shorter expiry for the same item. Not the common case, but something someone might do.
There is scope to combine the bools wasAccessed, wasDeleted and the int Position enum (1 byte + 1 byte + 4 bytes) - this could be packed into to single byte. Position should most likely be switched - this will eliminate the padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi, I might be misunderstanding.
If the entry was close to expiration and its lifetime was then extended (e.g. by a write), how would the approaches behave? If the call to the expiryCalculator happens during a maintenance cycle then there is a delay, so would the entry expire (and be visibly absent) or would it be resurrected? The latter could be confusing, especially if a load was triggered that could now be abandoned or would it replace the present value?
The naive but less confusing approach is to perform the expiryCalculator during the caller's operation to set the timestamp when the node's expires. Then on every maintenance cycle the accessed nodes would be reordered, possibly unnecessarily. That's hopefully ignorable since the cost is a cheap O(1) reordering.
Like you mentioned, there is a flaw of a missed reordering to shrink the expiry time. I think that could happen either in either approach. It would only happen to due GetExpireAfterRead
shrinking the duration because those events can be discarded by the buffer, whereas the GetExpireAfterUpdate
won't as writes cannot be lossy. I think its an okay scenario to ignore and rely on size eviction if a problem. A fix could be to promote the read event to instead submit into the write buffer if the duration shrinks by a significant amount. Then it would rarely take the penalty of a write buffer promotion and handle this edge case, so hopefully users wouldn't encounter it often enough to notice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got back to this at last :) But now it's not fresh in my mind.
I have probably diverged somewhat from your code - this is a summary:
- I deferred calling expiryCalculator so that it happens during maintenance. That way lookups only need to compare expiry time to current time, in addition to writing to the buffer. My thinking was to favor optimizing lookups at all costs (I realize this is into the hair-splitting zone). This also means that only 1 thread is ever updating the expiry time, because maintenance is already protected by a lock.
- Each time maintenance runs, the current time is evaluated only once, and all items in the read and write buffers will reflect the timestamp for that maintenance call. This assumes that the delay between reading/writing an item and maintenance running is very small.
- Now, the read/write buffer could contain a hundred instances of the same node, with the access timestamp. Assuming the expiry calculator produces an identical expiry for the same inputs, this would mean needlessly doing the same computation hundreds of times.
In summary, I was thinking about how to defend against the read buffer having a high number of duplicates and optimizing this busy work. Just as you wrote your reply I was studying your code to see if you had a similar guard - I guessed it would be in the onAccess method here.
Another simple thing I considered is just storing a pointer to the last node I processed during maintenance, so if you get a bunch in a row it would collapse them and do expiry calc + reschedule fewer times.
Good point that the shrinking expiry time is anyway possible for dropped reads, I was too deep into the impact of this hypothetical optimization to realize it already occurs. I feel like it's a benign case because it falls back to a misprediction/cache pollution, rather than something bad like unbounded growth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly rusty on this all too and happy to learn when you diverge.
I didn't optimize for duplicates and decided that the cost was too small to be worth optimizing. There is some hashing, pointer swaps, etc. which we tried to make very cheap. Then if I had to coalesce duplicates into a single sketch increment, for example, that likely means some extra hashing for building an intermediate multiset. Since each read buffer has a max of 16 elements * NCPUs, that's 256 read buffer entries on a 16-core machine per maintenance run. I think that I tried a naive version and didn't see any difference in a benchmark, which made sense since lossy buffers avoid a client-side penalty, so I couldn't convince myself it would be beneficial and dropped the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a fix here #603, my update code path can have a similar problem where we might not write the updated node to the write buffer, but that can only occur under load.
The overhead of the time APIs in .NET is inconsistent across operating systems - I currently have user space calls on Windows/Linux and a system call on MacOS (because unfortunately all the built in .NET time APIs are mapped to system calls on Mac).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got some focus time to look at this properly, now I fully understand the consequences of updating expiry during maintenance:
- Any entry dropped when the buffer is full (either a read or update) will have a stale expiry time.
- Maintenance must run after read & update without being deferred. For write heavy workloads, this doesn't change much. For read heavy workloads with infrequent cache activity, maintenance will run on every read increasing overhead.
The fix I did yesterday only addresses part of 2. Handling 1 & 2 with this design will increase complexity. With hindsight updating expiry as part of the lookup as you have done is the better tradeoff, I will fix this completely next weekend.
I hesitated with this feature back in December because I had some gaps, then came back recently and did stress tests where this wasn't apparent. Thanks for pointing me in the right direction, I will get it fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your other option is to honor the entry as expiring early, so it’s still linearizable on queries by not reappearing. That might mean a slightly lower hit rate for a slightly faster read. I’d likely prefer the hit rate but it seems rare enough to be personal preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - I would prefer better hit rate at the expense of a tiny per read overhead. Perf aside, fixing this also makes it much more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now fixed in v2.5.1. Thanks again for the heads up.
Support all time-based expiry modes for
ConcurrentLfu
.TimerWheel
andIExpiryCalculator