The throughput is inconsistent with the experimental results of the paper #3

Whiteleaf3er · 2023-04-25T06:44:01Z

I configured it as described in the paper, the trace is a 7:3 w/r ratio sample provided in this github, in the case of 64MiB CacheSize and 128MiB WSS

For Hit Ratio, austere can indeed perform better

However, the throughput of AC-D (800MiB/s) is only about half of CD-LRU-D (1900MiB/s), and the throughput of AC-D shown in the paper is better (and only 40MiB/s in the paper, which is not of the same order of magnitude as my results).

According to my observation, the main performance bottleneck lies in the index update, which may be related to the bucket structure you mentioned.

Whiteleaf3er · 2023-04-26T08:53:17Z

"According to my observation, the main performance bottleneck lies in the index update"
Time Elapsed:
Time elpased for compression: 0
Time elpased for decompression: 0
Time elpased for computeFingerprint: 36133
Time elpased for dedup: 27030
Time elpased for lookup: 18817
Time elpased for update_index: 190536
Time elpased for io_ssd: 5681
Time elpased for io_hdd: 5024
Time elpased for debug: 0
The above is the result of cachedup CD-LRU , time elpased for update_index is still very high, which is very confusing because usually index_update doesn't take much time

fallfish · 2023-04-30T03:20:25Z

Hi Whiteleaf3er,

Thanks for sharing with us your finding! I'm able to reproduce your results with the current prototype. I want to share some of my thoughts with you.

I suppose that you also run the program with the sample configuration (with which I can reproduce your result), where I set the "fakeIO" to one. That setting actually means there is no real I/O issued to any physical storage - it is for testing purposes. In the paper, we used a SATA (probably also much slow compared to what we have now, i.e., NVMe devices) SSD as the cache device. Also, you may need to be aware of the OS page cache when performing a performance test (we use direct I/O to bypass that).

As for the index update for DLRU design, my current suspicion is that the overhead is reasonable. The OPS (operations per second) is 163840 / (190536 / 1000000.0) ~ 860K (or 5G/0.19 ~ 26GB/s). According to the current prototype, we have std::map for both LBA index and FP index, along with their LRU lists. I had a test with std::map<uint64_t, FP> with 163840 entries inserted on i5-7267U CPU @ 3.10GHz. With "-O3" the time consumed is already 39657.8 microseconds (around 1/5 of your test). Given that we have multiple such structures and several memcpy around, the results look reasonable to me, while surely there can be optimizations.

Nevertheless, I admit that the complexity of Austere Cache design, e.g., the re-arrangement of all the items in one single bucket in the sketch, can lead to software deficiency. The software deficiency can make it inferior compared to existing designs in platforms with faster devices; it seems now is the time to re-examine the design : ). One possible optimization is to use vectorized instructions, such that the memory movement (in a whole bucket) can be accomplished in one single instruction and could be even potentially pipelined.

Should you have new findings or any questions, please feel free to reach me.

Thanks,
Qiuping

Whiteleaf3er · 2023-04-30T10:56:18Z

Hi Qiuping,

thank you for your reply! Yes, I use the sample for testing, in which fakeIO=1.
Whether it is CD-LRU or Austere, the main throughput bottleneck lies in update_index, especially Austere, where the time overhead of update_index is high. I will try to optimize it, thank you very much for your open-source sharing ser.

best wishes,
Whiteleaf3er

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The throughput is inconsistent with the experimental results of the paper #3

The throughput is inconsistent with the experimental results of the paper #3

Whiteleaf3er commented Apr 25, 2023

Whiteleaf3er commented Apr 26, 2023

fallfish commented Apr 30, 2023 •

edited

Loading

Whiteleaf3er commented Apr 30, 2023

The throughput is inconsistent with the experimental results of the paper #3

The throughput is inconsistent with the experimental results of the paper #3

Comments

Whiteleaf3er commented Apr 25, 2023

Whiteleaf3er commented Apr 26, 2023

fallfish commented Apr 30, 2023 • edited Loading

Whiteleaf3er commented Apr 30, 2023

fallfish commented Apr 30, 2023 •

edited

Loading