Releases: UTSASRG/Scaler
v0.2.4
v0.2.3
v0.2.3 implemented thread attribution approach 1 as described in #86. Approach 1 is logical and easy to prove compared to Approach2. The implementation will not introduce data race but still maintains low overhead.
Proved effectiveness through simple examples.
Added thread imbalance examples to the paper.
Improved performance and made Scaler faster than all other tools.
Improve benchmarksuites to include kernel memory measurement and make results more stable by adding delay between benchmarks.
v0.2.2
v0.2.2 implemented thread attribution approach 2 as described in #86. We decided not to add outlier removal as thread attribution already helped us to make performance bugs more apparent in Scaler's output.
With this implementation, we are able to sell time attribution (thread attribution, Join/Wait time attribution) as a major contribution and make effectiveness experimentation results more explainable.
We also identified 4 new examples to prove effectiveness.
v0.2.1
v0.2.1 has a series of improvements to help us understand Scaler's data
v0.2.1 contains a newly implemented benchmarktoolkit. The benchmarktoolkit supports automated experiments, automated artifact collection, benchmarking with multiple machines, file integrity checks, and provides a unified, easily expandable interface to run parsec and real applications together. Currently, the benchmark enables us to test: Parsec, httpd, nginx, memcachd, redis, mysql, postgresql. Scaler currently reports performance results on all those applications except postgresql as it seems to have some issues caused by multi-process. We identified the benchmark machine has CPU errors and disk errors. The postgresql problem is probably not Scaler's problem.
v0.2.1 also has a series of new python scripts to help interpret benchmark results better.
v0.2.1 removed the support of the previous Fine-Grained-Dynamic-Sampling (FGDS) method. The main problem is we cannot justify that FGDS won't impact the correctness of the result. Some details can be seen in #85.
v0.2.0
v0.2.0 greatly improved Scaler's performance. A new counting method was used and the pre-hook overhead was significantly reduced.
v0.2.0 also supports adaptive timing with customized strategies. This support greatly reduced the overall overhead.
Overhead tested on PARSEC benchmark:
- ASM Counting
- Runtime: 1.5%
- Memory: 2.5%
- ASM Counting + C pre-hook (Invoke every API call, this is the maximum possible overhead for C pre-hook)
- Runtime: 6.39%
- Memory: 3%
- ASM Counting + C pre-hook + C post-hook (Invoke every API call, this is the maximum possible overhead for Scaler)
- Runtime: 23.1%
- Memory: 2.6%
- ASM Counting + C pre-hook + C post-hook + Adaptive counting (The overhead in our paper. This might change based on experiment results. But generally we can control it under 5%)
- Runtime: 1.94%
- Memory: 1.18%
PARSEC
Runtime
Memory
v0.1.9
v0.1.8
v0.1.7
v0.1.7 is a extension of v0.1.2 with performance improvements and jmp code handling.
Improvements include:
- Runtime:
- Reduce function call.
- Instruction optimization.
- Branch prediction optimization.
- Rm dynamic compilation
- Memory
- Reduce unnecessary structure.
- Stability
- Handle jmp problem discovered in v0.1.6.
v0.1.7's commits also include an experimental way to map address to id in O(1) time. This version also reduce jmp number. But sadly the trade-off between runtime and memory is hard to do. So this version is abandoned. New v0.1.7 is similar to v0.1.2 but with significantly less overhead.
Evaluation on parsec:
v0.1.6
v0.1.6 mainly focus on reducing memory overhead.
Performed optimization in two aspects:
- Optimized non-hook code to make them more memory efficient.
- Implemented a more efficient hook method.
This new method removes the need for dynamic compilation and reduced memory consumption for hook part. But in the benchmark test, I found some user libraries will also use jmp to call plt entry. Although the majority of functions use standard way (call xxx@PLT) to implement. I cannot detect those non-standard ones. When jmp is used, the program will crash. I have to revert back to the original method. Problem illustrated in #45.
The memory overhead reduced significantly compared to v0.1.2.
I tested on swaptions (The program that used most memory in previous tests). Memory overhead dropped from 4.6x to 1.08x.
In v0.1.7, I will revert hook part to the original version.