Non-keepalive HTTP requests handling #1415

krizhanovsky · 2020-06-08T14:04:36Z

Scope

There was a claim that Tempesta FW processes new HTTP connections (ab without -k was used) at about the same speed as Nginx or HAProxy. The expected results aren't less than https://github.com/F-Stack/f-stack#nginx-testing-result .

Testing

Need to measure the performance and write an appropriate Wiki page how to setup a testing environment (I'd expect that ab was unable to cause enough load, the issue is in a virtualized NIC inside a VM, or some other environmental issue).

I mark the issue as bug as we never profiled exactly this workload, so there could be some synchronization issue.

The text was updated successfully, but these errors were encountered:

krizhanovsky · 2020-06-09T14:16:04Z

1 CPU VM

Test case

I tried 600-byte responses as in https://github.com/F-Stack/f-stack#nginx-testing-result (perl -le 'print "X" x 600' > /var/www/html/index.html). I also tried the benchmarks for 1 and 6000 byte responses and results were more or less the same.

Tempesta FW and Nginx are running inside a VM with virtio-net network interface, 1 CPU and 2GB RAM. The host CPU is i7-6500U. Benchmark tools were running from the host system.

The VM is accessible from the host system with debian name.

All the test cases were performed on the linux-4.14.32-tfw Tempesta kernel (the native Debian kernel didn't show any performance differences).

Nginx config

Nginx 1.14.2 was used with the following configuration file:

user www-data;
worker_processes auto;
worker_cpu_affinity auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;

events {
    worker_connections   65536;
    use epoll;
    multi_accept on;
    accept_mutex off;
}
worker_rlimit_nofile    1000000;

http {
    keepalive_timeout 600;
    keepalive_requests 10000000;
    sendfile         on;
    tcp_nopush       on;
    tcp_nodelay      on;

    open_file_cache max=1000 inactive=3600s;
    open_file_cache_valid 3600s;
    open_file_cache_min_uses 2;
    open_file_cache_errors off;

    error_log /dev/null emerg;
    access_log off;

    server {
	listen 9090 backlog=131072 deferred reuseport fastopen=4096;

        location / {
            root /var/www/html;
        }
    }
}

Tempesta FW config

The Nginx instance is used as the backend web server.

listen 192.168.100.4:80;

srv_group default {
	server 127.0.0.1:9090; # apache
}
vhost default {
	proxy_pass default;
}

cache 1;
cache_fulfill * *;

http_chain {
	-> default;
}

Benchmark tool

ab -n 100000 -c 10000 can't efficiently handle the 10K connections, so Nginx and Tempesta FW show the same performance numbers. You can observe that ab in this case spends 100% CPU.

The same non-keepalive test can be done with ./wrk -H 'Connection: close' -c 10000 -d 30 -t 2, which using multiple threads can cause more load. The test results were better for both of Nginx and Tempesta FW using wrk than ab.

Results

During all the tests wrk still can't cause enough load - the top on the host system looks as:

  12927 alex      20   0  333648  59328   3436 S 192.1   0.7   0:27.89 wrk                                         
   5805 root      20   0 2781888   2.1g  37472 S 162.4  26.9  52:15.49 qemu-system-x86

and there is no idle CPU.

VM eats more than 100% CPU due to the virtio-net paravirtualization.

Nginx

$ ./wrk -H 'Connection: close' -c 10000 -d 30 -t 2 http://debian:9090/
Running 30s test @ http://debian:9090/
  2 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    98.50ms  206.00ms   1.86s    93.52%
    Req/Sec     9.82k   697.78    11.47k    72.16%
  580530 requests in 30.06s, 461.73MB read
  Socket errors: connect 0, read 0, write 0, timeout 1618
Requests/sec:  19314.03
Transfer/sec:     15.36MB

Tempesta FW

$ ./wrk -H 'Connection: close' -c 10000 -d 30 -t 2 http://debian:80/
Running 30s test @ http://debian:80/
  2 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   133.84ms  259.65ms   1.87s    91.14%
    Req/Sec    10.20k     0.94k   12.21k    75.72%
  605073 requests in 30.05s, 513.37MB read
  Socket errors: connect 0, read 0, write 0, timeout 2436
Requests/sec:  20136.68
Transfer/sec:     17.08MB

Which is just 5% more than for Nginx.

Tempesta FW (HTTP/2 regression)

$ ./wrk -H 'Connection: close' -c 10000 -d 30 -t 2 http://debian:80/
Running 30s test @ http://debian:80/
  2 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    47.83ms  128.56ms   1.80s    89.39%
    Req/Sec     8.51k     1.20k   11.64k    66.78%
  502569 requests in 30.06s, 426.41MB read
  Socket errors: connect 0, read 0, write 0, timeout 1020
Requests/sec:  16720.13
Transfer/sec:     14.19MB

Which is below the Nginx's results. There are 3 key points in the perf profile for Tempesta FW:

__inet_lookup_established is on the top
however, the whole profile is flat and there is no sharp bottle neck
while the last HTTP/2 commits sacrifice HTTP/1 performance making `HTTP/2 the first class citizen', it's still unexpected to see HPACK-related calls in pure HTTP/1 benchmark.

     2.48%  swapper          [kernel.vmlinux]          [k] __inet_lookup_established
     1.68%  swapper          [kernel.vmlinux]          [k] kmem_cache_alloc
     1.64%  swapper          [tempesta_fw]             [k] tfw_hpack_cache_decode_expand
     1.55%  swapper          [tempesta_fw]             [k] tfw_http_msg_expand_data
     1.44%  swapper          [kernel.vmlinux]          [k] fib_table_lookup
     1.38%  swapper          [tempesta_fw]             [k] tfw_http_parse_req
     1.25%  swapper          [kernel.vmlinux]          [k] kmem_cache_free
     1.22%  swapper          [kernel.vmlinux]          [k] tcp_v4_rcv
     1.11%  swapper          [kernel.vmlinux]          [k] tcp_ack
     1.05%  swapper          [kernel.vmlinux]          [k] skb_release_data
     0.92%  swapper          [virtio_net]              [k] start_xmit
     0.86%  swapper          [kernel.vmlinux]          [k] __netif_receive_skb_core
     0.82%  swapper          [tempesta_fw]             [k] tfw_cache_h2_decode_write
     0.82%  swapper          [virtio_net]              [k] free_old_xmit_skbs.isra.28
     0.82%  swapper          [kernel.vmlinux]          [k] tcp_v4_early_demux
     0.80%  swapper          [kernel.vmlinux]          [k] irq_entries_start
     0.78%  swapper          [kernel.vmlinux]          [k] memcpy_erms

Resume

The tests can not be performed in with 2 CPU benchmark and properly tuned HTTP server even on 1 CPU VM.
There is actually performance degradation in Tempesta FW due to recent HTTP/2 commits.

krizhanovsky · 2020-06-09T19:49:17Z

User space TCP/IP stacks

As the original report was referencing F-stack as an alternative solution, it's worth discussing the differences of the approaches. Also read our Review of Google Snap paper.

DISCLAMER 1: we are married with the Linux TCP/IP stack at the moment, but we'd love to split up if we see more opportunities with a user-space networking. That'd be quite a work to move Tempesta FW to DPDK or similar platform, but it is doable.

DISCLAMER 2: there is nothing specific about F-Stack and there are several other kernel bypass solutions, e.g. Seastar. This is also not a competitive comparison of Tempesta FW with F-Stack/Nginx:

F-stack team focuses mostly on the network stack rather than final HTTPS performance on the Nginx side;
Tempesta FW is an HTTPS proxy developed from scratch aiming to achieve the highest HTTPS speed and, at the same time, significantly extend protection against DDoS and web attacks;
At the moment we focus solely on performance topic, not a usability or other aspects required for a reasonable comparison.

Scaling vs performance

First of all F-stack delivers even worse performance than the Linux kernel TCP/IP stack on small number of connections. It's still struggling from memory copies. As discussed in the referenced thread, the kernel bypass project is mostly about scaling on CPU cores rather than pure performance.

It's still TODO for this issue to explore how the Linux TCP/IP stack scales with Tempesta FW with increased number of CPU cores.

Application layer is the most bottle neck

Some time ago we made comparison of in-kernel Tempesta FW with a DPDK-based HTTP server Seastar, details are in my talk).
Basically, we provide the similar speed. There are 2 reasons for this:

this is an application layer with a heavyweight HTTP processing and complex logic, so the bottleneck isn't in networking when the key socket API overheads are removed;
Tempesta FW doesn't use high-level socket API using file descriptors, so many locks and queues were removed (the usual reason why Nginx stops scaling on multi-core systems).

The same for Redis on top of F-Stack: fast network layer doesn't impact to much for the application performance.

Mainstream performance extensions

With Tempesta FW we're keep the Linux kernel patch as small as possible to be able to migrate to newer kernels easily. (Honestly, we're not so quick in this.)

F-Stack team seems needs quite a work to move to a newer FreeBSD TCP/IP stack.

Since F-Stack seems doesn't do much work in reworking the FreeBSD TCP/IP stack, the question is whether the FreeBSD TCP/IP stack is actually faster than the Linux's one. It seems not. There are other performance comparisons aroung Linux vs FreeBSD performance and scaling on multiple cores, e.g. https://www.phoronix.com/scan.php?page=article&item=3990x-freebsd-bsd .

Back in 2009-2010 we did some work in FreeBSD performance improvements for web hosting needs. In most cases we just re-implemented some mechanisms from the Linux kernel. We also considered FreeBSD as the platform for Tempesta FW (mostly because of the license), but end up with Linux solely due to the performance reason.

DPDK in general

It does make sense to consider DPDK for new network protocols like QUIC. However, following concerns must be considered:

always 100% CPU usage (i.e. more power consumption), see also Nginx Benchmarking with Linux TCP/IP stack and F-stack F-Stack/f-stack#249 . Google Snap fixes this, but generic problem of processes sleep/wakeup can't be solved. In the kernel space we use inter-process interrupts (IPIs) for this.
no sendfile(2) and other zero-copy filesystem operations. Not a big issue though for web servers using a database for web cache like Tempesta FW or Apache Traffic Server.
No control on preemption. You can move the DPDK worker processes to a separate scheduling group, but strictly speaking you can not use lock-free algorithms relying on preemption control.

From the other hand, DPDK basically doesn't provide anything better than the Linux softirq infrastructure. This means that QUIC, being developed from scratch, in kernel space won't have legacy code supporting too many features, so it could be not less fast than a DPDK one.

Resume

We still need to test Tempesta FW and the modern Linux TCP/IP stack against multi-core scaling on establishing many TCP connections.
DPDK itself isn't the answer for high performance HTTPS server. You also need specialized application code, including TLS and HTTP parsing.

krizhanovsky · 2020-06-09T20:23:33Z

It seems there are some problem with the performance tests of F-stack against vanilla Nginx/Linux F-Stack/f-stack#519 . I hope the F-stack team reply regarding the issue, otherwise it make sense to test Tempesta FW only against Nginx/Linux, not Nginx/F-stack.

krizhanovsky · 2020-06-10T21:22:31Z

Preliminary results are described in the Wiki https://github.com/tempesta-tech/tempesta/wiki/HTTP-transactions-performance :

Tempesta FW scales about 2 times better than Nginx on 2 CPUs, even having a bug of 20% performance regression
Nginx and the Linux TCP/IP stack do scale from 2 CPUs to 4 CPUs, almost linearly.

We need to check the results further with the F-Stack and F5 guys because our picture is completely different:

we see that Nginx/Linux does scale: the guys also see almost no difference in the results for Nginx/Linux - only 40% scale from 1 to 12 CPU cores.
however we see much lower RPS numbers than they (60KRPS on 4 vCPU VM vs about 100-150KRPS in their tests), maybe due to weaker CPU and/or not enough load onto the SUT

TODO

benchmark Tempesta FW and Nginx on 4 CPU VM wich enough load. I couldn't manage to setup mTCP ab, while it looks quite promising. Or maybe using the second server generating the load.
If you can run mTCP ab, then it does make sense to test Tempesta FW and Nginx on 8 cores (inside a VM and a bare metal server)
Add the results table and graph for 1, 2, 4, and probably 8 CPUs results to the Wiki.
I hit a kernel issue on the load generator Poor __inet_check_established() implementation #1419 - need more investigation.
Need to get the top output for 4 CPU SUT with Nginx under the maximum load.

vankoven · 2020-06-12T08:47:36Z

A few notes about DPDK as an addition to #1415 (comment)

It's hard to compare in-kernel networking with DPDK directly, since the latest is a set of building blocks and doesn't contain a TCP/IP stack. Http/3 is coming, but http/2 and http/1.x is still here.
In most cases kernel stack has better tooling which existed and evolved for many years, like tcpdump, systemtap, eBPF, etc.

No control on preemption. You can move the DPDK worker processes to a separate scheduling group, but strictly speaking you can not use lock-free algorithms relying on preemption control.

Preemption control is still possible on DPDK. isolcpu kernel parameter can be used to detach cores from kernel scheduler and use them exclusively for DPDK application.

sburn · 2020-06-12T20:21:07Z

IMHO very poor links are used in a comparison of the networking performance in FreeBSD and Linux.

Using FreeBSD and commodity parts, Netflix achieves 90 Gb/s serving TLS-encrypted connections with ~55% CPU on a 16-core 2.6-GHz CPU (source).

Add FreeBSD kernel-side support for in-kernel TLS

NUMA Siloing in the FreeBSD Network Stack (Or how to serve 200Gb/s of TLS from FreeBSD)

krizhanovsky · 2020-06-13T13:00:42Z

Hi @sburn ,

I didn't get whether 'links' mean references which I used to compare FreeBSD and Linux networking stacks or network links :)

In the second case, there is no physical network links in the benchmarks - all the networking was done among two VMs on the same host.

In the first case, unfortunately I didn't find any good and fresh comparative benchmarks for Linux and FreeBSD networking. FreeBSD uses very tiny socket buffers mbuf, which is beneficial for the network performance. Linux also works on shrinking socket buffer sk_buff size, but it's actually constantly growing due to new features pushed into the stack. There is also amazing Netgraph in the FreeBSD TCP/IP stack (which was an inspiration for the Tempesta generic finite state machine). I believe there are also other technical advantages of the FreeBSD network stack over the Linux networking. I agree that FreeBSD can deliver great performance and I respect the people working on the TCP/IP stack. But, having kernel development experience in both the operating system, I'd say that Linux typically employs much more advanced technologies than FreeBSD. I also didn't see any 'game changing' FreeBSD performance advantages, which made people move from Linux to FreeBSD. The whole development process in the Linux kernel is much faster than in FreeBSD, because more people are investing their time and budget into the development. That's why we ended up with the Linux TCP/IP stack.

krizhanovsky · 2020-06-16T20:05:43Z

The issue is actually a duplicate of #806 . See more details in the wiki pages:

Performance data for Tempesta FW on 4 CPU VM with macvtap interface and wrk running on a remote server with 10Gbps link:

# wrk --latency -H 'Connection: close' -c 16384 -d 30 -t 16 http://172.16.0.200/
Running 30s test @ http://172.16.0.200/
  16 threads and 16384 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    27.61ms   98.16ms   1.71s    91.69%
    Req/Sec     7.03k     1.74k   14.48k    72.56%
  Latency Distribution
     50%    3.49ms
     75%    4.94ms
     90%    8.97ms
     99%  417.74ms
  3361037 requests in 30.06s, 2.79GB read
  Socket errors: connect 0, read 0, write 0, timeout 534
Requests/sec: 111807.44
Transfer/sec:     94.90MB

For Nginx on the same setup:

# wrk --latency -H 'Connection: close' -c 16384 -d 30 -t 16 http://172.16.0.200:9090/
Running 30s test @ http://172.16.0.200:9090/
  16 threads and 16384 connections
^C  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    33.59ms  101.61ms   1.92s    91.52%
    Req/Sec     5.18k     1.71k   11.37k    62.63%
  Latency Distribution
     50%    6.82ms
     75%   14.34ms
     90%   23.79ms
     99%  490.17ms
  1374094 requests in 16.71s, 1.07GB read
  Socket errors: connect 0, read 0, write 0, timeout 317
Requests/sec:  82227.96
Transfer/sec:     65.32MB

The bottle neck for Tempesta FW is host interrupts (perf kvm stat):

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

  EXTERNAL_INTERRUPT    5073570    75.37%     2.02%      0.22us   3066.52us      0.96us ( +-   0.49% )
       EPT_MISCONFIG    1029496    15.29%     1.78%      0.34us   1795.92us      4.19us ( +-   0.35% )
           MSR_WRITE     279208     4.15%     0.16%      0.28us   5695.07us      1.36us ( +-   2.52% )
                 HLT     194422     2.89%    95.74%      0.30us 1504068.39us   1192.90us ( +-   4.03% )
   PENDING_INTERRUPT      89818     1.33%     0.03%      0.32us    189.53us      0.70us ( +-   0.83% )
   PAUSE_INSTRUCTION      40905     0.61%     0.26%      0.26us   1390.91us     15.39us ( +-   1.82% )
    PREEMPTION_TIMER      17384     0.26%     0.01%      0.44us    183.21us      1.49us ( +-   1.47% )
      IO_INSTRUCTION       5482     0.08%     0.01%      1.75us    186.08us      3.26us ( +-   1.19% )
               CPUID        972     0.01%     0.00%      0.30us      5.29us      0.66us ( +-   1.94% )
            MSR_READ        104     0.00%     0.00%      0.49us      2.54us      0.94us ( +-   3.29% )
       EXCEPTION_NMI          6     0.00%     0.00%      0.37us      0.78us      0.59us ( +-   9.68% )

Unfortunately, for the moment we have no good enough hardware with NIC supporting SR-IOV and CPU supporting vAPIC.

I created task for the HTTP/2 performance regression #1422 . I'll also update the Wiki pages about the benchmarks, virtual environment performance and add specific system requirements to https://github.com/tempesta-tech/tempesta/wiki/Requirements for virtual environments.

krizhanovsky added bug crucial test performance http/1.1 labels Jun 8, 2020

krizhanovsky added this to the 0.7 HTTP/2 milestone Jun 8, 2020

krizhanovsky self-assigned this Jun 8, 2020

krizhanovsky mentioned this issue Jun 8, 2020

Test: automated performance testing suite #781

Open

13 tasks

krizhanovsky assigned vankoven Jun 10, 2020

krizhanovsky mentioned this issue Jun 16, 2020

h2 perfromance regression for HTTP/1 cache #1422

Open

2 tasks

krizhanovsky closed this as completed Jun 16, 2020

krizhanovsky added duplicate and removed test bug labels Jun 16, 2020

krizhanovsky mentioned this issue Jun 16, 2020

Configuration checking #1423

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-keepalive HTTP requests handling #1415

Non-keepalive HTTP requests handling #1415

krizhanovsky commented Jun 8, 2020

krizhanovsky commented Jun 9, 2020 •

edited

Loading

krizhanovsky commented Jun 9, 2020

krizhanovsky commented Jun 9, 2020 •

edited

Loading

krizhanovsky commented Jun 10, 2020 •

edited

Loading

vankoven commented Jun 12, 2020

sburn commented Jun 12, 2020 •

edited

Loading

krizhanovsky commented Jun 13, 2020 •

edited

Loading

krizhanovsky commented Jun 16, 2020

Non-keepalive HTTP requests handling #1415

Non-keepalive HTTP requests handling #1415

Comments

krizhanovsky commented Jun 8, 2020

Scope

Testing

krizhanovsky commented Jun 9, 2020 • edited Loading

1 CPU VM

Test case

Nginx config

Tempesta FW config

Benchmark tool

Results

Nginx

Tempesta FW

Tempesta FW (HTTP/2 regression)

Resume

krizhanovsky commented Jun 9, 2020

User space TCP/IP stacks

Scaling vs performance

Application layer is the most bottle neck

Mainstream performance extensions

DPDK in general

Resume

krizhanovsky commented Jun 9, 2020 • edited Loading

krizhanovsky commented Jun 10, 2020 • edited Loading

vankoven commented Jun 12, 2020

sburn commented Jun 12, 2020 • edited Loading

krizhanovsky commented Jun 13, 2020 • edited Loading

krizhanovsky commented Jun 16, 2020

krizhanovsky commented Jun 9, 2020 •

edited

Loading

krizhanovsky commented Jun 9, 2020 •

edited

Loading

krizhanovsky commented Jun 10, 2020 •

edited

Loading

sburn commented Jun 12, 2020 •

edited

Loading

krizhanovsky commented Jun 13, 2020 •

edited

Loading