-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate block replacement algorithms which receive attention on improvement of hit ratio #200
Comments
I attempted to implement an LIRS cache and conducted performance testing.
Reference: |
Lately, we have encountered unexpected code generation overhead from MIR. As an experiment, @qwe661234 is testing clang as the backend for JIT code generation instead of MIR. Please switch to the git branch |
In addition to the typical benchmark suite, which includes CoreMark and Dhrystone, we also make use of additional programs such as |
It seems that the code on the "wip/research_jit" branch is unable to pass the "make check" test with jit enabled.
|
Nice! |
I've noticed that when using the code from the 'wip/jit' branch with MIR as the JIT backend and running the scimark2 benchmark with my implemented LIRS cache, I've encountered a use after free error. It's possible that there might be an issue with my current implementation of LIRS or there could be a problem in the JIT-related implementation. I will conduct a more thorough examination of the relevant code to pinpoint the exact cause of the error. |
Actually, the function |
After conducting further testing, I have observed that regardless of the cache replacement algorithm used, the cache hit ratio consistently remains above 99.99%. Therefore, I believe that the primary factor influencing performance may be the overhead of cache_get() operations, despite their theoretical O(1) time complexity. Based on the current test results, I would recommend continuing to use LFU as the default cache replacement algorithm. Here is the testing data:
|
I am concerned that benchmark programs like CoreMark and Dhrystone may not accurately represent real-world workloads, especially when applying this RISC-V instruction emulation within a system emulation context, similar to what semu achieves. Can you suggest empirical approaches for utilizing cache replacement policies in system emulation? |
I acknowledge the possibility that CoreMark or Dhrystone may exhibit significant differences from real-world workloads. Perhaps, it would be prudent for us to explore alternative benchmarks that more accurately represent the actual usage scenarios. I will make an effort to identify more suitable benchmarks for our testing purposes. |
If you are exploring semu, please note that there is a fork called semu-c64 which includes additional features like |
libCacheSim is
|
I noticed that the JIT compiler's code cache in the JVM doesn't utilize any cache replacement algorithm. Instead, when the code cache is full, it either disables the JIT compiler or flushes the code cache (depending on the configuration). This surprised me, and I haven't delved into the reasons behind this approach. |
My speculation is that in real workloads, the number of cache_get operations is typically much larger than cache_put operations. However, it's only during cache_put that the need to select a victim for eviction from the cache may arise. Therefore, attempting to reduce the cache miss ratio and increase the overhead of cache get operations (even though this overhead is usually low) might not be cost-effective. However, this assumption requires further experimentation and validation. |
@qwe661234, can you clarify this? |
Sure, I believe the reason why JVM flushes the code cache when it's full is similar to the implementation in QEMU. The runtime-generated machine code is stored in contiguous memory, and its size varies. This concept is akin to OS memory management. If we don't divide the machine code into fixed sizes like memory pages, problems may arise. For instance, if the size of new machine code is larger than the replaced one, it might cause an overflow. Conversely, if the size of the new machine code is smaller than the replaced one, it could lead to segmentation. Therefore, flushing the code cache when the code cache is full is a simpler way to address these issues. |
I attempted to run |
I ensure that the problem does not arise from this. We haven't designed any mechanism to flush the cache. Therefore, if the code cache is full, the program would abort. I will design this later. |
For scenarios that require caching, a good cache eviction strategy can improve the cache hit rate. The most commonly used cache eviction strategies are LFU (Least Frequently Used) and LRU (Least Recently Used). LFU needs to maintain the frequency of each element, and for accuracy, it requires maintaining the global frequency of elements, which can bring significant space costs. Moreover, once an element establishes a frequency, it becomes difficult to evict, but in reality, access frequency can change over time. W-TinyLFU, on the other hand, is a very excellent cache eviction strategy that comprehensively considers various issues that might be encountered in real scenarios. W-TinyLFU has the best hit rate and strong adaptability. However, it is relatively complex to implement. If such a high hit rate is not required, SLRU could also be considered, as it also has a high hit rate. Reference: |
I'd recommend capturing workload traces and run through a simulator like Caffeine's. That is extremely helpful to see how the policy actually performs and to make a choice that balances between versatility, simplicity, efficiency, etc. As a general purpose library, Caffeine can't make many assumptions as our users expect us to excel in all cases. That means databases and analytics (MRU-biased), event-driven messaging (LRU-biased), or the typical in-between (mixed LRU/LFU). Some will want concurrency with high throughput, low latency, and linearizable behavior (like computations). Then add O(1), low memory footprint, features, and an easy to use api. The adaptivity allows us to have superior or competitive hit rates in all workloads, and I'll test on traces from new papers in case there are any weaknesses to investigate. LIRS (v1 and v2) is also excellent in my experience and makes a similar attempt at being a strong general purpose algorithm. Recently a CMU paper claimed to have significantly better results, but the author didn't show their data. When probing them and running their traces, I found that this difference was 2% in a single cherrypicked case while their policy did miserably across a variety of other real workloads. That's in fact perfectly fine and not meant to discredit their work. It's quite nice to have a variety of choices that are vetted for specialized workloads so that you can make the right tradeoffs. I would only say it is important to validate as some research papers are more marketing than substance. I'd also run the author's simulators to avoid bias, as I have observed both less than honest work and also innocent bugs. If you can narrow down on a set of tradeoffs (what you want and don't care about) then I can point you at likely candidates. But if you don't have a strong inclination now then I'd recommend choosing something simple to maintain that is good enough so that you can swap it out later once you know more if necessary (aka YAGNI). |
A block replacement algorithm continues to receive attention for the improvement of its hit ratio. Numerous replacement algorithms have been proposed, among which LIRS stands out with its consistently higher hit ratio across various workloads, while incurring low time and space overheads. However, there are still access patterns where LIRS yields a sub-optimal hit ratio and possesses potential for further improvement.
Recently, LIRS2 has been introduced, incorporating a new measure into the LIRS algorithm aimed at reducing its miss ratio. Through extensive experiments on traces from various sources, LIRS2 has consistently demonstrated an enhanced cache miss ratio with minimal overhead.
Quoted from LIRS2: An Improved LIRS Replacement Algorithm:
Reference:
The text was updated successfully, but these errors were encountered: