Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-keepalive HTTP requests handling #1415

Closed
krizhanovsky opened this issue Jun 8, 2020 · 8 comments
Closed

Non-keepalive HTTP requests handling #1415

krizhanovsky opened this issue Jun 8, 2020 · 8 comments

Comments

@krizhanovsky
Copy link
Contributor

Scope

There was a claim that Tempesta FW processes new HTTP connections (ab without -k was used) at about the same speed as Nginx or HAProxy. The expected results aren't less than https://github.com/F-Stack/f-stack#nginx-testing-result .

Testing

Need to measure the performance and write an appropriate Wiki page how to setup a testing environment (I'd expect that ab was unable to cause enough load, the issue is in a virtualized NIC inside a VM, or some other environmental issue).

I mark the issue as bug as we never profiled exactly this workload, so there could be some synchronization issue.

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Jun 9, 2020

1 CPU VM

Test case

I tried 600-byte responses as in https://github.com/F-Stack/f-stack#nginx-testing-result (perl -le 'print "X" x 600' > /var/www/html/index.html). I also tried the benchmarks for 1 and 6000 byte responses and results were more or less the same.

Tempesta FW and Nginx are running inside a VM with virtio-net network interface, 1 CPU and 2GB RAM. The host CPU is i7-6500U. Benchmark tools were running from the host system.

The VM is accessible from the host system with debian name.

All the test cases were performed on the linux-4.14.32-tfw Tempesta kernel (the native Debian kernel didn't show any performance differences).

Nginx config

Nginx 1.14.2 was used with the following configuration file:

user www-data;
worker_processes auto;
worker_cpu_affinity auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;

events {
    worker_connections   65536;
    use epoll;
    multi_accept on;
    accept_mutex off;
}
worker_rlimit_nofile    1000000;

http {
    keepalive_timeout 600;
    keepalive_requests 10000000;
    sendfile         on;
    tcp_nopush       on;
    tcp_nodelay      on;

    open_file_cache max=1000 inactive=3600s;
    open_file_cache_valid 3600s;
    open_file_cache_min_uses 2;
    open_file_cache_errors off;

    error_log /dev/null emerg;
    access_log off;

    server {
	listen 9090 backlog=131072 deferred reuseport fastopen=4096;

        location / {
            root /var/www/html;
        }
    }
}

Tempesta FW config

The Nginx instance is used as the backend web server.

listen 192.168.100.4:80;

srv_group default {
	server 127.0.0.1:9090; # apache
}
vhost default {
	proxy_pass default;
}

cache 1;
cache_fulfill * *;

http_chain {
	-> default;
}

Benchmark tool

ab -n 100000 -c 10000 can't efficiently handle the 10K connections, so Nginx and Tempesta FW show the same performance numbers. You can observe that ab in this case spends 100% CPU.

The same non-keepalive test can be done with ./wrk -H 'Connection: close' -c 10000 -d 30 -t 2, which using multiple threads can cause more load. The test results were better for both of Nginx and Tempesta FW using wrk than ab.

Results

During all the tests wrk still can't cause enough load - the top on the host system looks as:

  12927 alex      20   0  333648  59328   3436 S 192.1   0.7   0:27.89 wrk                                         
   5805 root      20   0 2781888   2.1g  37472 S 162.4  26.9  52:15.49 qemu-system-x86                             

and there is no idle CPU.

VM eats more than 100% CPU due to the virtio-net paravirtualization.

Nginx

$ ./wrk -H 'Connection: close' -c 10000 -d 30 -t 2 http://debian:9090/
Running 30s test @ http://debian:9090/
  2 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    98.50ms  206.00ms   1.86s    93.52%
    Req/Sec     9.82k   697.78    11.47k    72.16%
  580530 requests in 30.06s, 461.73MB read
  Socket errors: connect 0, read 0, write 0, timeout 1618
Requests/sec:  19314.03
Transfer/sec:     15.36MB

Tempesta FW

$ ./wrk -H 'Connection: close' -c 10000 -d 30 -t 2 http://debian:80/
Running 30s test @ http://debian:80/
  2 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   133.84ms  259.65ms   1.87s    91.14%
    Req/Sec    10.20k     0.94k   12.21k    75.72%
  605073 requests in 30.05s, 513.37MB read
  Socket errors: connect 0, read 0, write 0, timeout 2436
Requests/sec:  20136.68
Transfer/sec:     17.08MB

Which is just 5% more than for Nginx.

Tempesta FW (HTTP/2 regression)

$ ./wrk -H 'Connection: close' -c 10000 -d 30 -t 2 http://debian:80/
Running 30s test @ http://debian:80/
  2 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    47.83ms  128.56ms   1.80s    89.39%
    Req/Sec     8.51k     1.20k   11.64k    66.78%
  502569 requests in 30.06s, 426.41MB read
  Socket errors: connect 0, read 0, write 0, timeout 1020
Requests/sec:  16720.13
Transfer/sec:     14.19MB

Which is below the Nginx's results. There are 3 key points in the perf profile for Tempesta FW:

  1. __inet_lookup_established is on the top
  2. however, the whole profile is flat and there is no sharp bottle neck
  3. while the last HTTP/2 commits sacrifice HTTP/1 performance making `HTTP/2 the first class citizen', it's still unexpected to see HPACK-related calls in pure HTTP/1 benchmark.
     2.48%  swapper          [kernel.vmlinux]          [k] __inet_lookup_established
     1.68%  swapper          [kernel.vmlinux]          [k] kmem_cache_alloc
     1.64%  swapper          [tempesta_fw]             [k] tfw_hpack_cache_decode_expand
     1.55%  swapper          [tempesta_fw]             [k] tfw_http_msg_expand_data
     1.44%  swapper          [kernel.vmlinux]          [k] fib_table_lookup
     1.38%  swapper          [tempesta_fw]             [k] tfw_http_parse_req
     1.25%  swapper          [kernel.vmlinux]          [k] kmem_cache_free
     1.22%  swapper          [kernel.vmlinux]          [k] tcp_v4_rcv
     1.11%  swapper          [kernel.vmlinux]          [k] tcp_ack
     1.05%  swapper          [kernel.vmlinux]          [k] skb_release_data
     0.92%  swapper          [virtio_net]              [k] start_xmit
     0.86%  swapper          [kernel.vmlinux]          [k] __netif_receive_skb_core
     0.82%  swapper          [tempesta_fw]             [k] tfw_cache_h2_decode_write
     0.82%  swapper          [virtio_net]              [k] free_old_xmit_skbs.isra.28
     0.82%  swapper          [kernel.vmlinux]          [k] tcp_v4_early_demux
     0.80%  swapper          [kernel.vmlinux]          [k] irq_entries_start
     0.78%  swapper          [kernel.vmlinux]          [k] memcpy_erms

Resume

  1. The tests can not be performed in with 2 CPU benchmark and properly tuned HTTP server even on 1 CPU VM.
  2. There is actually performance degradation in Tempesta FW due to recent HTTP/2 commits.

@krizhanovsky
Copy link
Contributor Author

User space TCP/IP stacks

As the original report was referencing F-stack as an alternative solution, it's worth discussing the differences of the approaches. Also read our Review of Google Snap paper.

DISCLAMER 1: we are married with the Linux TCP/IP stack at the moment, but we'd love to split up if we see more opportunities with a user-space networking. That'd be quite a work to move Tempesta FW to DPDK or similar platform, but it is doable.

DISCLAMER 2: there is nothing specific about F-Stack and there are several other kernel bypass solutions, e.g. Seastar. This is also not a competitive comparison of Tempesta FW with F-Stack/Nginx:

  1. F-stack team focuses mostly on the network stack rather than final HTTPS performance on the Nginx side;
  2. Tempesta FW is an HTTPS proxy developed from scratch aiming to achieve the highest HTTPS speed and, at the same time, significantly extend protection against DDoS and web attacks;
  3. At the moment we focus solely on performance topic, not a usability or other aspects required for a reasonable comparison.

Scaling vs performance

First of all F-stack delivers even worse performance than the Linux kernel TCP/IP stack on small number of connections. It's still struggling from memory copies. As discussed in the referenced thread, the kernel bypass project is mostly about scaling on CPU cores rather than pure performance.

It's still TODO for this issue to explore how the Linux TCP/IP stack scales with Tempesta FW with increased number of CPU cores.

Application layer is the most bottle neck

Some time ago we made comparison of in-kernel Tempesta FW with a DPDK-based HTTP server Seastar, details are in my talk).
Basically, we provide the similar speed. There are 2 reasons for this:

  1. this is an application layer with a heavyweight HTTP processing and complex logic, so the bottleneck isn't in networking when the key socket API overheads are removed;
  2. Tempesta FW doesn't use high-level socket API using file descriptors, so many locks and queues were removed (the usual reason why Nginx stops scaling on multi-core systems).

The same for Redis on top of F-Stack: fast network layer doesn't impact to much for the application performance.

Mainstream performance extensions

With Tempesta FW we're keep the Linux kernel patch as small as possible to be able to migrate to newer kernels easily. (Honestly, we're not so quick in this.)

F-Stack team seems needs quite a work to move to a newer FreeBSD TCP/IP stack.

Since F-Stack seems doesn't do much work in reworking the FreeBSD TCP/IP stack, the question is whether the FreeBSD TCP/IP stack is actually faster than the Linux's one. It seems not. There are other performance comparisons aroung Linux vs FreeBSD performance and scaling on multiple cores, e.g. https://www.phoronix.com/scan.php?page=article&item=3990x-freebsd-bsd .

Back in 2009-2010 we did some work in FreeBSD performance improvements for web hosting needs. In most cases we just re-implemented some mechanisms from the Linux kernel. We also considered FreeBSD as the platform for Tempesta FW (mostly because of the license), but end up with Linux solely due to the performance reason.

DPDK in general

It does make sense to consider DPDK for new network protocols like QUIC. However, following concerns must be considered:

  1. always 100% CPU usage (i.e. more power consumption), see also Nginx Benchmarking with Linux TCP/IP stack and F-stack F-Stack/f-stack#249 . Google Snap fixes this, but generic problem of processes sleep/wakeup can't be solved. In the kernel space we use inter-process interrupts (IPIs) for this.

  2. no sendfile(2) and other zero-copy filesystem operations. Not a big issue though for web servers using a database for web cache like Tempesta FW or Apache Traffic Server.

  3. No control on preemption. You can move the DPDK worker processes to a separate scheduling group, but strictly speaking you can not use lock-free algorithms relying on preemption control.

From the other hand, DPDK basically doesn't provide anything better than the Linux softirq infrastructure. This means that QUIC, being developed from scratch, in kernel space won't have legacy code supporting too many features, so it could be not less fast than a DPDK one.

Resume

  1. We still need to test Tempesta FW and the modern Linux TCP/IP stack against multi-core scaling on establishing many TCP connections.

  2. DPDK itself isn't the answer for high performance HTTPS server. You also need specialized application code, including TLS and HTTP parsing.

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Jun 9, 2020

It seems there are some problem with the performance tests of F-stack against vanilla Nginx/Linux F-Stack/f-stack#519 . I hope the F-stack team reply regarding the issue, otherwise it make sense to test Tempesta FW only against Nginx/Linux, not Nginx/F-stack.

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Jun 10, 2020

Preliminary results are described in the Wiki https://github.com/tempesta-tech/tempesta/wiki/HTTP-transactions-performance :

  1. Tempesta FW scales about 2 times better than Nginx on 2 CPUs, even having a bug of 20% performance regression
  2. Nginx and the Linux TCP/IP stack do scale from 2 CPUs to 4 CPUs, almost linearly.

We need to check the results further with the F-Stack and F5 guys because our picture is completely different:

  1. we see that Nginx/Linux does scale: the guys also see almost no difference in the results for Nginx/Linux - only 40% scale from 1 to 12 CPU cores.
  2. however we see much lower RPS numbers than they (60KRPS on 4 vCPU VM vs about 100-150KRPS in their tests), maybe due to weaker CPU and/or not enough load onto the SUT

TODO

  1. benchmark Tempesta FW and Nginx on 4 CPU VM wich enough load. I couldn't manage to setup mTCP ab, while it looks quite promising. Or maybe using the second server generating the load.
  2. If you can run mTCP ab, then it does make sense to test Tempesta FW and Nginx on 8 cores (inside a VM and a bare metal server)
  3. Add the results table and graph for 1, 2, 4, and probably 8 CPUs results to the Wiki.
  4. I hit a kernel issue on the load generator Poor __inet_check_established() implementation #1419 - need more investigation.
  5. Need to get the top output for 4 CPU SUT with Nginx under the maximum load.

@vankoven
Copy link
Contributor

A few notes about DPDK as an addition to #1415 (comment)

  1. It's hard to compare in-kernel networking with DPDK directly, since the latest is a set of building blocks and doesn't contain a TCP/IP stack. Http/3 is coming, but http/2 and http/1.x is still here.
  2. In most cases kernel stack has better tooling which existed and evolved for many years, like tcpdump, systemtap, eBPF, etc.

No control on preemption. You can move the DPDK worker processes to a separate scheduling group, but strictly speaking you can not use lock-free algorithms relying on preemption control.

Preemption control is still possible on DPDK. isolcpu kernel parameter can be used to detach cores from kernel scheduler and use them exclusively for DPDK application.

@sburn
Copy link

sburn commented Jun 12, 2020

IMHO very poor links are used in a comparison of the networking performance in FreeBSD and Linux.

Using FreeBSD and commodity parts, Netflix achieves 90 Gb/s serving TLS-encrypted connections with ~55% CPU on a 16-core 2.6-GHz CPU (source).

Add FreeBSD kernel-side support for in-kernel TLS

NUMA Siloing in the FreeBSD Network Stack (Or how to serve 200Gb/s of TLS from FreeBSD)

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Jun 13, 2020

Hi @sburn ,

I didn't get whether 'links' mean references which I used to compare FreeBSD and Linux networking stacks or network links :)

In the second case, there is no physical network links in the benchmarks - all the networking was done among two VMs on the same host.

In the first case, unfortunately I didn't find any good and fresh comparative benchmarks for Linux and FreeBSD networking. FreeBSD uses very tiny socket buffers mbuf, which is beneficial for the network performance. Linux also works on shrinking socket buffer sk_buff size, but it's actually constantly growing due to new features pushed into the stack. There is also amazing Netgraph in the FreeBSD TCP/IP stack (which was an inspiration for the Tempesta generic finite state machine). I believe there are also other technical advantages of the FreeBSD network stack over the Linux networking. I agree that FreeBSD can deliver great performance and I respect the people working on the TCP/IP stack. But, having kernel development experience in both the operating system, I'd say that Linux typically employs much more advanced technologies than FreeBSD. I also didn't see any 'game changing' FreeBSD performance advantages, which made people move from Linux to FreeBSD. The whole development process in the Linux kernel is much faster than in FreeBSD, because more people are investing their time and budget into the development. That's why we ended up with the Linux TCP/IP stack.

@krizhanovsky
Copy link
Contributor Author

The issue is actually a duplicate of #806 . See more details in the wiki pages:

Performance data for Tempesta FW on 4 CPU VM with macvtap interface and wrk running on a remote server with 10Gbps link:

# wrk --latency -H 'Connection: close' -c 16384 -d 30 -t 16 http://172.16.0.200/
Running 30s test @ http://172.16.0.200/
  16 threads and 16384 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    27.61ms   98.16ms   1.71s    91.69%
    Req/Sec     7.03k     1.74k   14.48k    72.56%
  Latency Distribution
     50%    3.49ms
     75%    4.94ms
     90%    8.97ms
     99%  417.74ms
  3361037 requests in 30.06s, 2.79GB read
  Socket errors: connect 0, read 0, write 0, timeout 534
Requests/sec: 111807.44
Transfer/sec:     94.90MB

For Nginx on the same setup:

# wrk --latency -H 'Connection: close' -c 16384 -d 30 -t 16 http://172.16.0.200:9090/
Running 30s test @ http://172.16.0.200:9090/
  16 threads and 16384 connections
^C  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    33.59ms  101.61ms   1.92s    91.52%
    Req/Sec     5.18k     1.71k   11.37k    62.63%
  Latency Distribution
     50%    6.82ms
     75%   14.34ms
     90%   23.79ms
     99%  490.17ms
  1374094 requests in 16.71s, 1.07GB read
  Socket errors: connect 0, read 0, write 0, timeout 317
Requests/sec:  82227.96
Transfer/sec:     65.32MB

The bottle neck for Tempesta FW is host interrupts (perf kvm stat):

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

  EXTERNAL_INTERRUPT    5073570    75.37%     2.02%      0.22us   3066.52us      0.96us ( +-   0.49% )
       EPT_MISCONFIG    1029496    15.29%     1.78%      0.34us   1795.92us      4.19us ( +-   0.35% )
           MSR_WRITE     279208     4.15%     0.16%      0.28us   5695.07us      1.36us ( +-   2.52% )
                 HLT     194422     2.89%    95.74%      0.30us 1504068.39us   1192.90us ( +-   4.03% )
   PENDING_INTERRUPT      89818     1.33%     0.03%      0.32us    189.53us      0.70us ( +-   0.83% )
   PAUSE_INSTRUCTION      40905     0.61%     0.26%      0.26us   1390.91us     15.39us ( +-   1.82% )
    PREEMPTION_TIMER      17384     0.26%     0.01%      0.44us    183.21us      1.49us ( +-   1.47% )
      IO_INSTRUCTION       5482     0.08%     0.01%      1.75us    186.08us      3.26us ( +-   1.19% )
               CPUID        972     0.01%     0.00%      0.30us      5.29us      0.66us ( +-   1.94% )
            MSR_READ        104     0.00%     0.00%      0.49us      2.54us      0.94us ( +-   3.29% )
       EXCEPTION_NMI          6     0.00%     0.00%      0.37us      0.78us      0.59us ( +-   9.68% )

Unfortunately, for the moment we have no good enough hardware with NIC supporting SR-IOV and CPU supporting vAPIC.

I created task for the HTTP/2 performance regression #1422 . I'll also update the Wiki pages about the benchmarks, virtual environment performance and add specific system requirements to https://github.com/tempesta-tech/tempesta/wiki/Requirements for virtual environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants