For Algorithm 1, does increasing the number of threads improve or hurt performance? Use data to back up your answer.
Core Count | Average Time (ms) |
---|---|
1 | 0.287470 |
2 | 0.354853 |
4 | 0.325001 |
8 | 0.320909 |
Answer:
Algorithm 1 performed best with 1 thread and worst with 2 threads. Increasing the thread count generally hurt performance.
(a) For Algorithm 6, does increasing the number of threads improve or hurt performance? Use data to back up your answer.
(b) What is the speedup when using 2, 4, 8, and 16 threads?
Core Count | Average Time (ms) |
---|---|
1 | 0.262452 |
2 | 0.165475 |
4 | 0.156006 |
8 | 0.269185 |
(a) Answer:
Algorithm 6 performed best with 4 threads and worst with 8 threads. Performance improved initially but decreased beyond 4 threads.
(b) Speedup:
Taking 1 core as a base:
- 2 cores: 1.59
- 4 cores: 1.68
- 8 cores: 0.97
(a) Using the data for all 6 algorithms, what is the most important optimization: chunking the array, using different result addresses, or padding between result addresses?
(b) Speculate how the hardware implementation causes this result.
Refer to /data/cache-coherence/performance-gem5-mesi
:
(a) Answer:
The best performance was seen in algorithms 5 and 6, indicating that padding between result addresses is the most important optimization.
(b) Answer:
Poor cache usage seems to be the main bottleneck, so optimizations reducing cache conflicts significantly improve performance.
(a) What is the speedup of Algorithm 1 and Algorithm 6 on 16 cores as estimated by gem5?
(b) How does this compare to real system performance?
Core Count | Average Time (ms) |
---|---|
1 | 0.838 |
2 | 0.779 |
4 | 0.818 |
8 | 0.881 |
16 | 0.917 |
Core Count | Average Time (ms) |
---|---|
1 | 0.839 |
2 | 0.438 |
4 | 0.236 |
8 | 0.142 |
16 | 0.111 |
(a) Speedup:
- Algorithm 1 (Naive): 0.914
- Algorithm 6 (All): 7.56
(b) Comparison:
gem5 estimated speedup (0.914) is similar to the real system (0.897) for Algorithm 1. For Algorithm 6, gem5’s estimated speedup (7.56) is significantly higher than the real system (0.975).
Which optimization has the biggest impact on the hit ratio? Show data and discuss algorithms compared.
Algorithm | L1 Cache Hit Rate (%) | L2 Cache Hit Rate (%) |
---|---|---|
All | 99.45 | 71.70 |
Block | 96.64 | 89.65 |
Chunk | 96.29 | 71.38 |
Chunk-race | 96.72 | 32.36 |
Naive | 93.90 | 90.01 |
Race | 93.91 | 73.40 |
Answer:
While the base hit rate for the naive approach was high (93%), combining optimizations boosted it to 96%. The "All" algorithm, combining all strategies, showed the highest improvement.
Which optimization has the biggest impact on read sharing? Show data and discuss algorithms compared.
Algorithm | Total Read Share |
---|---|
All | 626 |
Block | 2404 |
Chunk | 599 |
Chunk-race | 610 |
Naive | 2380 |
Race | 2385 |
Answer:
Chunking had the most significant impact, reducing read sharing drastically compared to naive and race algorithms.
Which optimization has the biggest impact on write sharing? Show data and discuss algorithms compared.
Algorithm | Total Write Share |
---|---|
All | 230 |
Block | 192 |
Chunk | 32863 |
Chunk-race | 32776 |
Naive | 32906 |
Race | 32814 |
Answer:
Padding between addresses had the largest impact, as write sharing dropped from around 32,000 in chunk, naive and race methods to 200 in the "Block" and "All" algorithms.
(a) Which hardware characteristic (L1 hit ratio, read sharing, or write sharing) is most important for performance? Use average memory latency to justify.
(b) What hardware characteristic causes this optimization to be most important?
(a) Out of the three characteristics we have looked at—L1 hit ratio, read sharing, or write sharing—which is most important for determining performance? Use average memory latency and overall performance to address this question.
(b) Using data from the gem5 simulations, what hardware characteristic causes the most important optimization to be the most important?
Answer 8(a):
The write sharing characteristic has the most significant impact on performance. This is evident from the drastic reduction in execution time when padding between result addresses is applied, as seen in the "All" and "Block" algorithms. While L1 hit ratios were already high and improved only marginally, reducing write sharing significantly reduced average memory latency.
-
Write Sharing (Total Write Share):
- Chunk/Chunk-race: ~32,000 (high contention)
- All/Block: ~230 (low contention, better performance)
-
Performance Comparison:
- Naive (16 cores): 0.917 ms (high write sharing, poor scaling)
- All (16 cores): 0.111 ms (low write sharing, excellent scaling)
The drastic difference in execution time correlates with reduced write sharing, indicating it as the bottleneck.
The main hardware characteristic driving this optimization is the cache coherence protocol. When multiple cores write to nearby memory addresses, frequent cache invalidations and updates increase latency due to coherence traffic. Padding reduces false sharing by ensuring each core operates on separate cache lines, minimizing coherence overhead.
- False sharing causes frequent invalidations, reducing effective cache utilization.
- Padding aligns data to prevent multiple cores from accessing the same cache line, improving overall efficiency.
This hardware limitation makes minimizing write sharing crucial for multi-threaded performance.
How does increasing cache-to-cache latency (1, 10, 25 cycles) affect the importance of optimizations?
Answer:
Core Count | Simulated Time (Milliseconds) |
---|---|
1 | 0.833 |
2 | 0.427 |
4 | 0.226 |
8 | 0.134 |
16 | 0.102 |
Core Count | Simulated Time (Milliseconds) |
---|---|
1 | 0.847 |
2 | 0.456 |
4 | 0.251 |
8 | 0.156 |
16 | 0.127 |
Core Count | Simulated Time (Milliseconds) |
---|---|
1 | 0.833 |
2 | 0.573 |
4 | 0.347 |
8 | 0.338 |
16 | 0.349 |
Core Count | Simulated Time (Milliseconds) |
---|---|
1 | 0.847 |
2 | 0.101 |
4 | 0.958 |
8 | 0.988 |
16 | 0.176 |
Increasing cache-to-cache latency tends to hurt performance more in the Naive Algorithm than in Algorithm 6. At 1-Cycle Latency, the time decreases as core count increases for both algorithms, but Algorithm 6 shows more consistent performance improvement across cores, while Naive sees a small drop-off. At 25-Cycle Latency, the performance drop in Naive is more pronounced, with significant jumps in simulated time for 2, 4, and 8 cores, and 16 cores exhibiting the highest time increase compared to Algorithm 6.
Impact on Optimizations:
Algorithm 6 benefits significantly from optimizations such as chunking and padding, especially under higher latencies (e.g., 25 cycles), where performance gains from optimizing cache behavior become more evident. For Naive, optimizations that improve cache usage are crucial for minimizing latency impacts, particularly as latency increases, since the basic approach has fewer optimizations to rely on.