Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce raised IPIs #806

Closed
krizhanovsky opened this issue Aug 17, 2017 · 8 comments
Closed

Reduce raised IPIs #806

krizhanovsky opened this issue Aug 17, 2017 · 8 comments
Assignees
Labels
performance question Questions and support tasks

Comments

@krizhanovsky
Copy link
Contributor

krizhanovsky commented Aug 17, 2017

Currently each task put into a work queue for a different CPU raises IPI to wake up ksoftirqd on the CPU. It seems IPIs are relatively expensive (?) and there is sense to raise IPI only if ksoftirqd on the designated CPU is inactive. It seems we can just wakeup appropriate ksoftirqd just like wakeup_softirqd() does it.

Also consider integration with Flow Director - probably we can get some benefits from it in our intra-CPU communication at all. However, the main point for the intra-CPU transport is that we have to communicate via pair of sockets, client and server, and the pairs aren't stable. (Are they?)

@krizhanovsky krizhanovsky added this to the 1.0 WebOS milestone Aug 17, 2017
@krizhanovsky krizhanovsky changed the title Redice raised IPIs Reduce raised IPIs Aug 17, 2017
@krizhanovsky krizhanovsky added the question Questions and support tasks label Feb 18, 2018
@krizhanovsky krizhanovsky self-assigned this Mar 22, 2018
@krizhanovsky krizhanovsky modified the milestones: backlog, 0.6 KTLS Mar 22, 2018
@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Apr 18, 2018

Benchnark from http://natsys-lab.blogspot.ru/2018/03/http-requests-proxying.html for Tempesta FW and 3-byte index.html gives perf top as

     8.99%  [kernel]       [k] call_function_single_interrupt
     2.25%  [kernel]       [k] __default_send_IPI_dest_field
     1.74%  [kernel]       [k] queued_spin_lock_slowpath
     1.40%  [kernel]       [k] _raw_spin_lock_irqsave
     1.33%  [kernel]       [k] skb_release_data
     1.22%  [kernel]       [k] __inet_lookup_established
     1.03%  [kernel]       [k] tcp_ack
     1.01%  [kernel]       [k] syscall_return_via_sysret
     0.78%  [kernel]       [k] tcp_transmit_skb
     0.75%  [kernel]       [k] _raw_spin_lock
     0.72%  [kernel]       [k] native_apic_mem_write

I didn't do any deep analysis so far, but it seems IPIs hurt performance.

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Jul 17, 2018

It has sense to start from introducing an IPI counter in perfstat to analyze how many IPIs do we have now and after the fix. Also it's necessary to clearly understand how expensive IPI is, especially in contest of sofirq pipelining. E.g. it has sense to write a simple ping-pong benchmark comparing IPIs with CMXCHG instruction.

@aleksostapenko
Copy link
Contributor

aleksostapenko commented Sep 18, 2018

I've got alike results for IPI processing on my virtual machine; perf record/report output:

+    10.37%      [k] call_function_single_interrupt     [kernel.vmlinux]
+    2.64%       [k] update_stack_state                 [kernel.vmlinux]
+    1.62%       [k] unwind_next_frame.part.5           [kernel.vmlinux]
+    1.60%       [k] tfw_http_parse_resp                [tempesta_fw]
+    1.55%       [k] memcmp                             [kernel.vmlinux]
+    1.39%       [k] tcp_ack                            [kernel.vmlinux]
+    1.34%       [k] check_memory_region                [kernel.vmlinux]
+    1.25%       [k] skb_release_data                   [kernel.vmlinux]
+    1.20%       [k] __module_address                   [kernel.vmlinux]
+    1.17%       [k] memset_erms                        [kernel.vmlinux]
+    1.04%       [k] __inet_lookup_established          [kernel.vmlinux]
+    0.83%       [k] tcp_transmit_skb                   [kernel.vmlinux]
+    0.82%       [k] __skb_fragment                     [tempesta_fw]
+    0.80%       [k] tcp_write_xmit                     [kernel.vmlinux]

for benchmark from http://natsys-lab.blogspot.ru/2018/03/http-requests-proxying.html.

About 10% of time we spend in IPI processing, so the possible reason is in excessive
generation of IPI. Also it is worth to mention, that RPS was enabled during test, so the
part of IPI had been generated in order to distribute packets among SoftIRQs of different
CPUs in context of Packet Steering functionality.

Almost 80% of IPI generation (x2apic_send_IPI() function) is produced from irq_work_queue_on() (Tempesta FW) and net_rps_send_ipi() (RPS functionality) interfaces:

-    0.11%  [k] x2apic_send_IPI  [kernel.vmlinux]
   - x2apic_send_IPI
      - 79.55% native_send_call_func_single_ipi
         + 71.45% irq_work_queue_on
         - 28.55% generic_exec_single
              smp_call_function_single_async
            + net_rps_send_ipi

Both of these functions generate CALL_FUNCTION_SINGLE_VECTOR IPI, processed in
call_function_single_interrupt() handler (which is in top of perf output). We can compare
relative contribution to IPI generation volume from Tempesta FW, and from RPS functionality
(the overhead percentage is relative, not absolute, in the following perf output):

-   88.91%  [k] irq_work_queue_on  [kernel.vmlinux]
   - irq_work_queue_on
      + 90.26% ss_wq_push
      + 9.74% ss_send
-   11.09%  [k] net_rps_send_ipi   [kernel.vmlinux]
   - net_rps_send_ipi
      + 52.11% net_rx_action
      + 31.96% __do_softirq
      + 15.92% process_backlog

So, it seems that RPS processing does not make a significant contribution into IPI generation volume.

Other interesting perf outputs:
CPU balance
(hard IRQ of NIC is bound to CPU0)

  Overhead  CPU
+   26.49%  003
+   26.10%  001
+   24.75%  000
+   22.67%  002

Overhead percentage is relative in next two outputs.
Balance of ksoftirqd:

  Overhead  Command
+   74.39%  ksoftirqd/0
+    9.19%  ksoftirqd/2
+    8.32%  ksoftirqd/3
+    8.09%  ksoftirqd/1

Balance of SoftIRQ processing (via sampling of __do_softirq()):

  Overhead  CPU
-   31.62%  001
   - __do_softirq
      + 60.86% irq_exit
      + 17.60% do_softirq_own_stack
      + 9.93% run_ksoftirqd
      + 7.31% smp_call_function_single_interrupt
      + 4.30% __GI___setsockopt
-   27.26%  002
   - __do_softirq
      + 65.60% irq_exit
      + 17.18% do_softirq_own_stack
      + 9.20% run_ksoftirqd
      + 3.62% smp_call_function_single_interrupt
      + 3.19% smpboot_thread_fn
      + 1.21% __GI___setsockopt
-   26.54%  003
   - __do_softirq
      + 54.90% irq_exit
      + 24.70% do_softirq_own_stack
      + 11.77% run_ksoftirqd
      + 7.61% smp_call_function_single_interrupt
      + 1.02% __GI___setsockopt
-   14.57%  000
   - __do_softirq
      + 67.11% run_ksoftirqd
      + 21.14% irq_exit
      + 6.95% do_softirq_own_stack
      + 4.81% smpboot_thread_fn

It seems that CPU0 tends to process SoftIRQs in context of ksoftirqd (run_ksoftirqd() function),
while other CPUs spend more time processing SoftIRQs in context of IRQ (irq_exit() function in
sampling).

The counts of IPIs with vector CALL_FUNCTION_SINGLE_VECTOR for specified above benchmark test
(with 31K RPS) are:

  • generated: 1366K;
  • handled: 1347K;

(collected via perf stat for irq_vectors:* static tracepoints and dynamic kprobe tracepoint for x2apic_send_IPI() function with filter CALL_FUNCTION_SINGLE_VECTOR for vector variable).

@aleksostapenko
Copy link
Contributor

aleksostapenko commented Oct 2, 2018

Control benchmarks on VM after the fix gave the following results:
perf record/report:

4.65%  [k] call_function_single_interrupt
3.07%  [k] update_stack_state
2.03%  [k] pvclock_clocksource_read
1.92%  [k] unwind_next_frame.part.5
1.90%  [k] memcmp
1.75%  [k] tfw_http_parse_resp
1.56%  [k] check_memory_region
1.52%  [k] tcp_ack
1.52%  [k] __skb_fragment
1.26%  [k] __module_address
1.23%  [k] skb_release_data
1.09%  [k] __inet_lookup_established
1.04%  [k] memset_erms
0.90%  [k] tcp_transmit_skb
0.85%  [k] syscall_return_via_sysret
0.84%  [k] __raw_callee_save___pv_queued_spin_unlock
0.80%  [k] e1000_xmit_frame

i.e. time of IPI handler decreased in 2 times, but remains on the top.

The counts of IPIs with vector CALL_FUNCTION_SINGLE_VECTOR are:

  • generated: 783K;
  • handled: 764K;

Detailed picture for above counts (via perf stat):

. . .
764103      irq_vectors:call_function_single_entry
764103      irq_vectors:call_function_single_exit
. . .
782685      probe:x2apic_send_IPI
. . .
481285      probe:irq_work_queue_on
301322      probe:net_rps_send_ipi```

i.e. the count of IPI handler call_function_single_entry calls decreased in 1.8 time, and the same order - for IPI generation - x2apic_send_IPI, which consists of two components:

  1. net_rps_send_ipi - IPI generated by Packet Steering functionality - as expected, it has not changed almost;
  2. irq_work_queue_on - our IPIs, generated by Tempesta FW (via irq_work); the count decreased in 2.4 times.

RPS: 34K.

@aleksostapenko
Copy link
Contributor

Results of the same benchmark test - on bare metal server.
Before patch 5577d2a:
perf top:

2.42%  [tempesta_fw]        [k] tfw_http_parse_resp
1.81%  [kernel]             [k] skb_release_data
1.26%  [kernel]             [k] tcp_ack
1.23%  [mlx4_en]            [k] mlx4_en_process_rx_cq
1.18%  [kernel]             [k] _raw_spin_lock
1.14%  [tempesta_fw]        [k] tfw_http_parse_req
1.11%  [kernel]             [k] __inet_lookup_established
. . .
0.17%  [kernel]             [k] call_function_single_interrupt

perf stat with traceponts:

17141378      irq_vectors:call_function_single_entry
. . . 
11729040      probe:irq_work_queue_on
 5469709      probe:net_rps_send_ipi

After patch 5577d2a:
perf top:

2.53%  [tempesta_fw]       [k] tfw_http_parse_resp
1.76%  [kernel]            [k] skb_release_data
1.30%  [mlx4_en]           [k] mlx4_en_process_rx_cq
1.28%  [kernel]            [k] tcp_ack
1.25%  [kernel]            [k] _raw_spin_lock
1.12%  [kernel]            [k] __inet_lookup_established
1.12%  [kernel]            [.] syscall_return_via_sysret
.  .  .
0.15%  [kernel]            [k] call_function_single_interrupt

perf stat with tracepoints:

13477653      irq_vectors:call_function_single_entry
. . .
 7371283      probe:irq_work_queue_on
 6162090      probe:net_rps_send_ipi

In both cases the result with 230K - 240K RPS was observed.
It seems that on bare metal server there is no problem with too much time spent in call_function_single_interrupt handler.

@aleksostapenko
Copy link
Contributor

aleksostapenko commented Oct 4, 2018

More tests on the same benchmark had been made after fix, on VM and bare metal servers - to analyze the picture in comparison of separate CPUs. For tests - dynamic probes had been set in following places of Tempesta FW code:
*ss_ipi (entry),
*ss_tx_action (entry),
*ss_tx_action_1 (here

raise_softirq(NET_TX_SOFTIRQ);
to count exits - when the message(s) are remained in worq_queue and we regenerate SoftIRQ while IPI stay disbled),
*tfw_wq_pop_ticket (here
memcpy(buf, &q->array[tail & QMASK], WQ_ITEM_SZ);
to count times when work_queue is not empty and we get next message from it).

In general, it is clear, that under load (after fix), the more time we leave IPI disabled (i.e. the more time we spend in cycles of SoftIRQ regeneration on target CPU) - the less count of IPI we'll get.
The results of tests on VM, collected by Perf, are the following:
General:

310K probe:tfw_wq_pop_ticket
 10K  probe:ss_tx_action_1
 54K  probe:ss_tx_action
 45K  probe:ss_ipi

I.e. it can be seen, that the number of exits from ss_tx_action with disabled IPI is less than 20% (10K/54K) in total number of ss_tx_action calls; in remaining 80% cases IPI has been enabled, since work_queue was empty.

probe:ss_tx_action:

Overhead       Samples  CPU  Symbol            Shared Object
 26.25%         14191  003  [k] ss_tx_action  [tempesta_fw]  
 25.70%         13892  001  [k] ss_tx_action  [tempesta_fw]
 24.33%         13150  000  [k] ss_tx_action  [tempesta_fw]
 23.72%         12821  002  [k] ss_tx_action  [tempesta_fw]

probe:ss_tx_action_1

Overhead       Samples  CPU  Symbol          Shared Object
54.73%          5770  000  [k] ss_tx_action  [tempesta_fw]
17.23%          1816  003  [k] ss_tx_action  [tempesta_fw]
15.07%          1589  001  [k] ss_tx_action  [tempesta_fw]
12.97%          1367  002  [k] ss_tx_action  [tempesta_fw]

The ss_tx_action и ss_tx_action_1 tracing for separate CPUs shows that the situation is not identical for different CPUs (constantly): on CPU0, this figure is 44%, while on others only 10-13%.

probe:ss_ipi:

Overhead       Samples  CPU  Symbol    Shared Object
29.09%         13175  001  [k] ss_ipi  [tempesta_fw]
28.77%         13028  003  [k] ss_ipi  [tempesta_fw]
27.09%         12269  002  [k] ss_ipi  [tempesta_fw]
15.06%          6819  000  [k] ss_ipi  [tempesta_fw]

As a result, the ss_ipi tracing for separate CPUs also shows the count of processed IPI (i.e received IPI - thus, generated on other CPUs for CPU0) 2 times less on CPU0 than on other CPUs.
Such behavior of CPU0 is explained by its higher workload: the RX IRQ is assigned to it (since RSS is absence); also, enabled RPS forces CPU0 to spend time distributing incoming packets to other CPUs - this is confirmed by samples of RPS subsystem calls (not shown here).

probe:tfw_wq_pop_ticket:

Overhead       Samples  CPU  Symbol               Shared Object
40.98%        127195  000  [k] tfw_wq_pop_ticket  [tempesta_fw]
21.34%         66229  003  [k] tfw_wq_pop_ticket  [tempesta_fw]
20.04%         62205  001  [k] tfw_wq_pop_ticket  [tempesta_fw]
17.64%         54738  002  [k] tfw_wq_pop_ticket  [tempesta_fw]

The tfw_wq_pop_ticket tracing for separate CPUs allows to get rough estimate of the average work_queue size for each CPU as the ratio of ss_tx_action calls count to the number of accesses to non-empty queue in tfw_wq_pop_ticket: for CPU0 it is equal 10 (127195 / 13150), while for other CPUs it is equal 4–5.

The picture is almost the same on bare metal server, except the work load is more evenly distributed between CPUs (RSS enabled on with 8 queues per 8 CPUs on incoming Mellanox NIC, and RPS is disabled): on average, for each CPU, the estimated queue size is 2–3.

Thus, the above tests showed a low work_queue load: 2–5 messages retrieved from queue on each ss_tx_action call, and IPI enabling in 80% calls to ss_tx_action. And in this context, the situation with CPU0 on VM is especially interesting: due to the additional load from the IRQ, it has less time to service its work_queue and it offloads queue slower: its average queue size is 10 and IPI is enabled only in 50% of ss_tx_action calls; and, as a result, - decreased in 2–3 times count of IPI, generated for CPU0, and therefore, the similar decrease in the load of call_function_single_interrupt:

Overhead     Samples  CPU                   Symbol                          Shared Object
 1.54%          5159  002  [k] call_function_single_interrupt              [kernel.vmlinux]
 1.44%          4945  003  [k] call_function_single_interrupt              [kernel.vmlinux]
 1.41%          4826  001  [k] call_function_single_interrupt              [kernel.vmlinux]
. . .
 0.38%          1455  000  [k] call_function_single_interrupt              [kernel.vmlinux]

such distribution of IPI handler workload have separate CPUs from general workload (depicted above):

4.65%         [k] call_function_single_interrupt

So, it seems that the cause of the remaining count of IPI is that the queue is underloaded. In order to further reduce IPIs, it is possible to hypothetically make the minimum required number of iterations in ss_tx_action, or in NET_TX_SOFTIRQ regenerations, but these will be idle iterations, they will not do useful work (due to an empty queues), consuming the CPUs resource.

aleksostapenko added a commit that referenced this issue Oct 15, 2018
 Fix #806: Reduce count of IPI during work_queue processing.
@aleksostapenko
Copy link
Contributor

aleksostapenko commented Oct 29, 2018

Although above results of benchmark on VM (after fix) demonstrated significant decrease of IPI handler time (2x), but it is still remains rather high (4 - 5% of all samples during benchmark). Additional research has revealed that the cause of such behavior is in virtualization components (in this particular case these are QEMU-KVM hypervisor and Intel VMX hardware extension) and how they interact each other.
Hardware support for virtualization in Intel processors is provided by a operations called VMX operations. There are VMX root operation (which hypervisor run in) and VMX non-root operation (for guest software). VM-entry transitions is VMX root => VMX non-root. VM-exit transition is VMX non-root => VMX root. In VMX non-root mode certain instructions and events cause VM exits to the hypervisor. VMX non-root operation and VMX transitions are controlled by a data structure called a virtual-machine control structure (VMCS). A hypervisor could use a different VMCS for each virtual machine (VM) that it supports. For a VMs with multiple processors (VCPUs), some hypervisors (e.g. QEMU-KVM) use different VMCS for each VCPU.
On Intel processors with x2APIC support (used in benchmark) - IPI generation is simply write operation into some address of model specific registers (MSR), which (in QEMU-KVM hypervisor) causes a VM-exit on IPI source CPU. And on Intel processors without APICv support (used in benchmark) in QEMU-KVM hypervisor - IPI receiving also causes VM-exit.
At the same time, KVM uses domain model for PMC virtualization, which implies saving and restoring of relevant PMC registers only when execution switch between different guests, and not save/restore PMC registers during VM-exit/VM-resume in context of one current guest. So, in case of IPI receiving on target CPU we collect PMC also during hypervisor operations (between VM-exit and VM-entry) and not only for guest's operations. This accounts not only IPI-handler processing in guest but also IPI-handler processing on host (on behalf of guest IPI) on target CPU - including manipulation VCPU control structures, virtual IPI injecting, etc. This is confirmed by profiling the guest via special command perf kvm record/report which collect guest operating system statistics from the host in system-wide manner:

        Overhead       Samples  Shared Object                 Symbol
           2,54%          9481  [guest.kernel.kallsyms.8600]  [g] update_stack_state
           2,26%          8311  [guest.kernel.kallsyms.8600]  [g] pvclock_clocksource_read
           1,96%          7150  [guest.kernel.kallsyms.8600]  [g] tfw_http_parse_resp
           1,82%          6728  [guest.kernel.kallsyms.8600]  [g] check_memory_region
           1,69%          6254  [guest.kernel.kallsyms.8600]  [g] tcp_ack
           1,47%          5437  [guest.kernel.kallsyms.8600]  [g] skb_release_data
           1,45%          5429  [guest.kernel.kallsyms.8600]  [g] unwind_next_frame.part.5
           1,40%          5305  [guest.kernel.kallsyms.8600]  [g] swiotlb_tbl_unmap_single
           1,31%          4885  [guest.kernel.kallsyms.8600]  [g] memcmp
           1,23%          4565  [guest.kernel.kallsyms.8600]  [g] __inet_lookup_established
           1,22%          4480  [guest.kernel.kallsyms.8600]  [g] memset_erms
           .   .   .
           0,34%          1221  [guest.kernel.kallsyms.8600]  [g] call_function_single_interrupt

Also some tests were made to investigate influence of VCPU-CPU pinning for guest system performance.
In QEMU-KVM hypervisor each VM is separate QEMU process on host. Each VCPU (in VMX/KVM) - is a thread of corresponding QEMU process. These threads are distributed on host CPUs and between them by host Linux scheduler like any other usual thread/process in OS.
Regarding of details of IPI and PMC virtualization in VMX/KVM described above - VCPUs pinning for separate CPUs does not help to reduce of measured time of IPI-handler processing in VM, since (as mentioned above) this measured time accounts not only IPI-handler processing in guest but also IPI-handler processing on host (on behalf of guest IPI), and real host IPIs in this case also will be generated between CPUs. To avoid the necessity of such host IPIs (which intended to virtualize guest's IPIs) - we can pin all VCPUs to one CPU - however, expectedly the overall effectiveness of the guest system will decrease dramatically in this case.
Regarding of overall guest system performance - in default configuration VCPUs of virtual machine are not pinned to any CPU and can be distributed between all of them (regardless of VCPUs count specified in -smp QEMU option). E.g. if we defined two VCPU for guest VM on the host with four CPUs, then any of these two VCPU can be executed on any of host CPUs. Below is per-cpu perf output for particular VM with four VCPUs executed on host with four CPUs (Pid column includes LWP (thread) IDs):

# perf report -i perf-cpu.data -n --no-children --sort=cpu,pid --call-graph=callee,fractal --pid=25241

                   Overhead   Samples  CPU    Pid:Command
                 +    3,98%         11754   000  25244:qemu-system-x86
                 +    3,57%         10839   001  25244:qemu-system-x86
                 +    3,32%         10661   001  25246:qemu-system-x86
                 +    3,32%          9874   003  25244:qemu-system-x86
                 +    3,29%          9897   002  25244:qemu-system-x86
                 +    3,26%         10182   000  25245:qemu-system-x86
                 +    3,21%         10203   001  25245:qemu-system-x86
                 +    3,18%         10310   003  25246:qemu-system-x86
                 +    3,07%          9765   003  25245:qemu-system-x86
                 +    3,04%          9951   002  25247:qemu-system-x86
                 +    3,03%          9894   000  25247:qemu-system-x86
                 +    2,98%          9420   000  25246:qemu-system-x86
                 +    2,93%          9513   001  25247:qemu-system-x86
                 +    2,89%          9177   002  25245:qemu-system-x86
                 +    2,80%          8993   002  25246:qemu-system-x86
                 +    2,78%          9083   003  25247:qemu-system-x86
                 +    1,57%          5013   001  25241:qemu-system-x86
                 +    1,54%          4935   000  25241:qemu-system-x86
                 +    1,52%          4887   002  25241:qemu-system-x86
                 +    1,48%          4790   003  25241:qemu-system-x86

Due to free distribution between all host CPUs, situations with several VCPUs on one CPU processing often could occur:

virsh # vcpuinfo tfw-debian9

                 VCPU:           0
                 CPU:            2
                 State:          running
                 CPU time:       335,7s
                 CPU Affinity:   yyyy

                 VCPU:           1
                 CPU:            1
                 State:          running
                 CPU time:       302,6s
                 CPU Affinity:   yyyy

                 VCPU:           2
                 CPU:            1
                 State:          running
                 CPU time:       279,4s
                 CPU Affinity:   yyyy

                 VCPU:           3
                 CPU:            2
                 State:          running
                 CPU time:       285,5s
                 CPU Affinity:   yyyy

To avoid this default behavior there is variant of pinning guest's all VCPUs to host's CPUs (e.g. via libvirt):

                 VCPU:           0
                 CPU:            0
                 State:          running
                 CPU time:       406,6s
                 CPU Affinity:   y---

                 VCPU:           1
                 CPU:            1
                 State:          running
                 CPU time:       369,5s
                 CPU Affinity:   -y--

                 VCPU:           2
                 CPU:            2
                 State:          running
                 CPU time:       342,2s
                 CPU Affinity:   --y-

                 VCPU:           3
                 CPU:            3
                 State:          running
                 CPU time:       350,3s
                 CPU Affinity:   ---y
                Overhead       Samples  CPU    Pid:Command
                 +   13,96%         32734  000  25244:qemu-system-x86
                 +   13,46%         34001  003  25247:qemu-system-x86
                 +   12,36%         32114  001  25245:qemu-system-x86
                 +   11,18%         31111  002  25246:qemu-system-x86
                 +    2,19%          5920  001  25241:qemu-system-x86
                 +    2,18%          5724  003  25241:qemu-system-x86
                 +    1,84%          5320  002  25241:qemu-system-x86
                 +    0,32%           807  000  25241:qemu-system-x86

(25241 pid in last perf output is special iothread in VM process intended for I/O processing of emulated devices, this is not VCPU thread - so it is not pinned to any of CPUs).
But result of such pinning demonstrates the same performance (at least in 'requests per second' value) compared against not pinned VCPUs.
It seems that host Linux scheduler's balancer effectively distribute not pinned VCPUs (QEMU threads) between host CPUs; pinning is necessary only when some workloads are already pinned for particular host CPUs, so we need manually distribute (pin) all workloads in this case - to gain maximum performance.

@aleksostapenko
Copy link
Contributor

aleksostapenko commented Nov 6, 2018

Also, a series of benchmarks were made with Tempesta inside VM, on the server with APICv VMX extension (see Performance for details of APICv support verification and enabling). As expected, the perf profiling inside VM showed, that posted-interrupts technique (supported in APICv extension) eliminates accounting inside VM of performance counters, which belongs to hypervisor's processing:

Overhead       Samples  Symbol                                          Shared Object
   3.13%         12520  [k] update_stack_state                          [kernel.kallsyms]
   2.20%          8844  [k] tfw_http_parse_resp                         [tempesta_fw]
   1.99%          7985  [k] check_memory_region                         [kernel.kallsyms]
   1.83%          7328  [k] unwind_next_frame.part.5                    [kernel.kallsyms]
   1.73%          6712  [k] swiotlb_tbl_unmap_single                    [kernel.kallsyms]
   1.62%          6535  [k] tcp_ack                                     [kernel.kallsyms]
   1.53%          6081  [k] memcmp                                      [kernel.kallsyms]
   1.31%          5295  [k] tcp_transmit_skb                            [kernel.kallsyms]
   1.13%          4559  [k] __skb_fragment                              [tempesta_fw]
   1.07%          4299  [k] __raw_callee_save___pv_queued_spin_unlock   [kernel.kallsyms]
   1.00%          4025  [k] pvclock_clocksource_read                    [kernel.kallsyms]
   0.99%          3986  [k] skb_release_data                            [kernel.kallsyms]
   0.96%          3843  [k] __module_address                            [kernel.kallsyms]
   0.93%          3725  [k] memset_erms                                 [kernel.kallsyms]
   0.88%          3552  [k] memcpy_erms                                 [kernel.kallsyms]
   0.83%          3352  [k] __inet_lookup_established                   [kernel.kallsyms]
   0.82%          3307  [k] tcp_v4_rcv                                  [kernel.kallsyms]
   0.80%          3254  [k] tcp_write_xmit                              [kernel.kallsyms]
   0.77%          3101  [k] e1000_xmit_frame                            [e1000]
   0.76%          3033  [k] tfw_http_parse_req                          [tempesta_fw]
   0.75%          3034  [k] __new_pgfrag                                [tempesta_fw]
   0.72%          2889  [k] syscall_return_via_sysret                   [kernel.kallsyms]
.   .   .
   0.31%          1278  [k] call_function_single_interrupt              [kernel.kallsyms]

Besides, improved interrupts virtualization is confirmed by perf VM-exits reports (perf kvm stat record/report commands) - with APICv disabled:

Analyze events for all VMs, all VCPUs:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

           MSR_WRITE    1104720    42.61%     2.12%      0.40us    133.74us      1.29us ( +-   0.05% )
  EXTERNAL_INTERRUPT     611584    23.59%     0.99%      0.35us     96.26us      1.09us ( +-   0.07% )
                 HLT     348133    13.43%    72.31%      0.45us 1504088.88us    139.75us ( +-   6.86% )
       EPT_MISCONFIG     345096    13.31%    24.36%      3.41us   1243.19us     47.50us ( +-   0.28% )
   PENDING_INTERRUPT     108899     4.20%     0.10%      0.40us     14.90us      0.65us ( +-   0.08% )
            MSR_READ      40982     1.58%     0.05%      0.47us     36.65us      0.77us ( +-   0.16% )
    PREEMPTION_TIMER      23366     0.90%     0.05%      0.61us      5.94us      1.37us ( +-   0.17% )
   PAUSE_INSTRUCTION      10058     0.39%     0.01%      0.35us      2.48us      0.85us ( +-   0.41% )
       EXCEPTION_NMI          8     0.00%     0.00%      0.48us      1.48us      1.04us ( +-  11.43% )

Total Samples:2592846, Total events handled time:67281206.37us.

and with APICv enabled:

Analyze events for all VMs, all VCPUs:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

           MSR_WRITE    1134652    57.72%     2.33%      0.44us    289.09us      1.38us ( +-   0.05% )
                 HLT     383203    19.49%    72.47%      0.55us 1504065.27us    127.42us ( +-   7.33% )
       EPT_MISCONFIG     354353    18.03%    25.05%      3.46us    820.70us     47.62us ( +-   0.28% )
         EOI_INDUCED      41768     2.12%     0.06%      0.62us     17.43us      0.92us ( +-   0.11% )
    PREEMPTION_TIMER      22146     1.13%     0.04%      0.72us      7.53us      1.31us ( +-   0.17% )
  EXTERNAL_INTERRUPT      18617     0.95%     0.04%      0.41us   1605.60us      1.40us ( +-   6.48% )
   PAUSE_INSTRUCTION      10963     0.56%     0.01%      0.39us     38.51us      0.92us ( +-   0.54% )
            MSR_READ        125     0.01%     0.00%      0.59us      2.66us      1.49us ( +-   2.55% )
       EXCEPTION_NMI          7     0.00%     0.00%      0.59us      1.37us      1.00us ( +-  11.16% )

Total Samples:1965834, Total events handled time:67374057.88us.

Reports show that VM-exits caused by EXTERNAL_INTERRUPT has decreased about 30 times (and total VM-exits count decreased by 25%) with APICv enabled.
However, the RPS value did not demonstrate significant growth with APICv enabled: slightly increased in average by 1.5-2K (that was tested on 4 VCPUs, on 12 VCPUs - pinned and not pinned, with VirtIO NIC and with QEMU-emulated NIC).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance question Questions and support tasks
Projects
None yet
Development

No branches or pull requests

2 participants