-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The throughput is inconsistent with the experimental results of the paper #3
Comments
"According to my observation, the main performance bottleneck lies in the index update" |
Hi Whiteleaf3er, Thanks for sharing with us your finding! I'm able to reproduce your results with the current prototype. I want to share some of my thoughts with you. I suppose that you also run the program with the sample configuration (with which I can reproduce your result), where I set the "fakeIO" to one. That setting actually means there is no real I/O issued to any physical storage - it is for testing purposes. In the paper, we used a SATA (probably also much slow compared to what we have now, i.e., NVMe devices) SSD as the cache device. Also, you may need to be aware of the OS page cache when performing a performance test (we use direct I/O to bypass that). As for the index update for DLRU design, my current suspicion is that the overhead is reasonable. The OPS (operations per second) is 163840 / (190536 / 1000000.0) ~ 860K (or 5G/0.19 ~ 26GB/s). According to the current prototype, we have std::map for both LBA index and FP index, along with their LRU lists. I had a test with std::map<uint64_t, FP> with 163840 entries inserted on i5-7267U CPU @ 3.10GHz. With "-O3" the time consumed is already 39657.8 microseconds (around 1/5 of your test). Given that we have multiple such structures and several memcpy around, the results look reasonable to me, while surely there can be optimizations. Nevertheless, I admit that the complexity of Austere Cache design, e.g., the re-arrangement of all the items in one single bucket in the sketch, can lead to software deficiency. The software deficiency can make it inferior compared to existing designs in platforms with faster devices; it seems now is the time to re-examine the design : ). One possible optimization is to use vectorized instructions, such that the memory movement (in a whole bucket) can be accomplished in one single instruction and could be even potentially pipelined. Should you have new findings or any questions, please feel free to reach me. Thanks, |
Hi Qiuping, thank you for your reply! Yes, I use the sample for testing, in which fakeIO=1. best wishes, |
I configured it as described in the paper, the trace is a 7:3 w/r ratio sample provided in this github, in the case of 64MiB CacheSize and 128MiB WSS
For Hit Ratio, austere can indeed perform better
However, the throughput of AC-D (800MiB/s) is only about half of CD-LRU-D (1900MiB/s), and the throughput of AC-D shown in the paper is better (and only 40MiB/s in the paper, which is not of the same order of magnitude as my results).
According to my observation, the main performance bottleneck lies in the index update, which may be related to the bucket structure you mentioned.
The text was updated successfully, but these errors were encountered: