-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce raised IPIs #806
Comments
Benchnark from http://natsys-lab.blogspot.ru/2018/03/http-requests-proxying.html for Tempesta FW and 3-byte index.html gives perf top as
I didn't do any deep analysis so far, but it seems IPIs hurt performance. |
It has sense to start from introducing an IPI counter in perfstat to analyze how many IPIs do we have now and after the fix. Also it's necessary to clearly understand how expensive IPI is, especially in contest of sofirq pipelining. E.g. it has sense to write a simple ping-pong benchmark comparing IPIs with CMXCHG instruction. |
I've got alike results for IPI processing on my virtual machine;
for benchmark from http://natsys-lab.blogspot.ru/2018/03/http-requests-proxying.html. About 10% of time we spend in IPI processing, so the possible reason is in excessive Almost 80% of IPI generation (
Both of these functions generate
So, it seems that RPS processing does not make a significant contribution into IPI generation volume. Other interesting
Overhead percentage is relative in next two outputs.
Balance of
It seems that CPU0 tends to process SoftIRQs in context of The counts of IPIs with vector
(collected via |
Control benchmarks on VM after the fix gave the following results:
i.e. time of IPI handler decreased in 2 times, but remains on the top. The counts of IPIs with vector
Detailed picture for above counts (via
i.e. the count of IPI handler
RPS: 34K. |
Results of the same benchmark test - on bare metal server.
After patch 5577d2a:
In both cases the result with |
More tests on the same benchmark had been made after fix, on VM and bare metal servers - to analyze the picture in comparison of separate CPUs. For tests - dynamic probes had been set in following places of Tempesta FW code: Line 1382 in c39e8d7
worq_queue and we regenerate SoftIRQ while IPI stay disbled),* tfw_wq_pop_ticket (heretempesta/tempesta_fw/work_queue.c Line 171 in c39e8d7
work_queue is not empty and we get next message from it).
In general, it is clear, that under load (after fix), the more time we leave IPI disabled (i.e. the more time we spend in cycles of SoftIRQ regeneration on target CPU) - the less count of IPI we'll get.
I.e. it can be seen, that the number of exits from
The
As a result, the
The The picture is almost the same on bare metal server, except the work load is more evenly distributed between CPUs (RSS enabled on with 8 queues per 8 CPUs on incoming Mellanox NIC, and RPS is disabled): on average, for each CPU, the estimated queue size is 2–3. Thus, the above tests showed a low
such distribution of IPI handler workload have separate CPUs from general workload (depicted above):
So, it seems that the cause of the remaining count of IPI is that the queue is underloaded. In order to further reduce IPIs, it is possible to hypothetically make the minimum required number of iterations in |
Fix #806: Reduce count of IPI during work_queue processing.
Although above results of benchmark on VM (after fix) demonstrated significant decrease of IPI handler time (2x), but it is still remains rather high (4 - 5% of all samples during benchmark). Additional research has revealed that the cause of such behavior is in virtualization components (in this particular case these are QEMU-KVM hypervisor and Intel VMX hardware extension) and how they interact each other.
Also some tests were made to investigate influence of VCPU-CPU pinning for guest system performance.
Due to free distribution between all host CPUs, situations with several VCPUs on one CPU processing often could occur:
To avoid this default behavior there is variant of pinning guest's all VCPUs to host's CPUs (e.g. via
( |
Also, a series of benchmarks were made with Tempesta inside VM, on the server with
Besides, improved interrupts virtualization is confirmed by
and with APICv enabled:
Reports show that VM-exits caused by |
Currently each task put into a work queue for a different CPU raises IPI to wake up ksoftirqd on the CPU. It seems IPIs are relatively expensive (?) and there is sense to raise IPI only if ksoftirqd on the designated CPU is inactive. It seems we can just wakeup appropriate ksoftirqd just like
wakeup_softirqd()
does it.Also consider integration with Flow Director - probably we can get some benefits from it in our intra-CPU communication at all. However, the main point for the intra-CPU transport is that we have to communicate via pair of sockets, client and server, and the pairs aren't stable. (Are they?)
The text was updated successfully, but these errors were encountered: