Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High cpu usage for outbound ssl requests after upgrading from v16.15.0 to v18.1.0 #43128

Closed
str1ke opened this issue May 17, 2022 · 18 comments
Closed
Labels
confirmed-bug Issues with confirmed bugs. tls Issues and PRs related to the tls subsystem.

Comments

@str1ke
Copy link

str1ke commented May 17, 2022

Version

v18.1.0

Platform

Linux 5.17.4-arch1-1 #1 SMP PREEMPT Wed, 20 Apr 2022 18:29:28 +0000 x86_64 GNU/Linux

Subsystem

crypto

What steps will reproduce the bug?

I have observed this problem in few http clients: axios, got. But looks like a problem is in node.js itself.

  1. Create a basic http server which produces outbound ssl requests and returns in response result of outbound request. Here is a sample
const http = require("http");
const https = require("https");

const port = 3000;

const handler = (req, res) => {
  https.get("https://server.local", (response) => {
    let body = "";

    response.on('data', chunk => {
      body += chunk;
    });

    response.on("end", () => {
      res.statusCode = response.statusCode;
      res.setHeader('Content-Type', 'text/plain');
      res.end(body);
    });
  });
};

const server = http.createServer(handler);

server.listen(port);
  1. Start to produce load to server. I'm using hey
hey -z 1m -c 10 -disable-keepalive "http://client.local:3000"
  1. Observe in CPU profiler for node.js server constantly increasing cpu usage for SecureContext.init(). It starts from ~50% and going up to 80%. Amount of requests processed by server, which sends outbound https requests is constantly decreasing.

How often does it reproduce? Is there a required condition?

Always reproduces if present outbound ssl requests and node version 18.1.0

What is the expected behavior?

Constant CPU usage, amount of processed requests is at the same level during time. In node.js v16.15.0 same code in CPU profiler have SecureContext.init() with usage of 5-8%. And during load there is no performance degradation.

What do you see instead?

CPU usage is keeping increasing during time, SecureContext.init() starts from ~50% cpu time in profiler and keep growing during time to 80-85%. Amount of requests processed by server is constantly decreasing.

Additional information

v18.1.0
Benchmark result - 186 req/s
node --perf output - https://pastebin.com/wyHPXfMH
cpu profile screenshot
cpu_profile_v18 1 0

v16.15.0
Benchmark result - 1184 req/s
node --perf output - https://pastebin.com/0L0Z01iq
cpu profile screenshot
cpu_profile_v16 15 0

@bnoordhuis
Copy link
Member

That's almost certainly caused by the upgrade to openssl 3. I don't know what your https://server.local endpoint is doing but try checking what the TLS version and cipher are with v16 and v18.

@str1ke
Copy link
Author

str1ke commented May 18, 2022

@bnoordhuis Thanks for reply. Sorry, I forgot to mention about server with ssl

ssl server

SSL server it is just plain nginx with self signed certificate and one location

location / {
    return 200 'hello world';
}

In first attempts to debug issue I was using node.js with ssl as a server, but then replaced it with nginx. The problem with node.js which performs outbound ssl requests.

tls version and cipher

I tried to compare ssl information in v16 and v18, but haven't found difference.
v16 node --trace-tls full output

Received Record
Header:
  Version = TLS 1.2 (0x303)
  Content Type = Handshake (22)
  Length = 69
    ServerHello, Length=65
      server_version=0x303 (TLS 1.2)
      Random:
        gmt_unix_time=0x3C4CEC8D
        random_bytes (len=28): 7E124B88A8D3A98A4435610DB3FA64121B975A23E157B3D023ED8310
      session_id (len=0):
      cipher_suite {0xC0, 0x2F} TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
...
    ServerKeyExchange, Length=296
      KeyExchangeAlgorithm=ECDHE
...

v18 node --trace-tls full output

Received Record
Header:
  Version = TLS 1.2 (0x303)
  Content Type = Handshake (22)
  Length = 69
    ServerHello, Length=65
      server_version=0x303 (TLS 1.2)
      Random:
        gmt_unix_time=0x3625953A
        random_bytes (len=28): DDB717917379017CD1B82F3D10ECCBD3CE5FCD94F0BFF08D59C1FBBB
      session_id (len=0):
      cipher_suite {0xC0, 0x2F} TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
...
    ServerKeyExchange, Length=296
      KeyExchangeAlgorithm=ECDHE
...

perf logs comparision

One interesting thing, which I don't know how to treat correctly is in perf logs.
For v16 it looks like this

 [Summary]:
   ticks  total  nonlib   name
  12338    7.7%   34.2%  JavaScript
  23649   14.7%   65.5%  C++
   3711    2.3%   10.3%  GC
  124857   77.6%          Shared libraries
    131    0.1%          Unaccounted

 [C++ entry points]:
   ticks    cpp   total   name
   2758   21.6%    1.7%  writev@@GLIBC_2.2.5
   2467   19.3%    1.5%  __write@@GLIBC_2.2.5
   1857   14.5%    1.2%  __libc_malloc@@GLIBC_2.2.5
   1203    9.4%    0.7%  cfree@GLIBC_2.2.5

For v18

 [Summary]:
   ticks  total  nonlib   name
   2049    1.2%   11.2%  JavaScript
  16170    9.7%   88.7%  C++
    601    0.4%    3.3%  GC
  148780   89.1%          Shared libraries
     20    0.0%          Unaccounted

 [C++ entry points]:
   ticks    cpp   total   name
   5513   37.9%    3.3%  __pthread_rwlock_unlock@GLIBC_2.2.5
   4488   30.9%    2.7%  __pthread_rwlock_rdlock@GLIBC_2.2.5
   1638   11.3%    1.0%  __libc_malloc@@GLIBC_2.2.5
    746    5.1%    0.4%  __pthread_rwlock_wrlock@GLIBC_2.2.5
    590    4.1%    0.4%  cfree@GLIBC_2.2.5
    349    2.4%    0.2%  writev@@GLIBC_2.2.5

If I read this correctly in v18 (openssl 3) application spends most of the time in locks (pthread_rwlock_unlock and pthread_rwlock_rdlock)

@str1ke
Copy link
Author

str1ke commented May 18, 2022

Also one interesting observation for node.js v18.1.0 + fetch as http client

const http = require("http");

const handler = async (req, res) => {
  const response = await fetch("https://server.local");
  const body = await response.text();

  res.statusCode = response.status;
  res.setHeader('Content-Type', 'text/plain');
  res.end(body);
};

const server = http.createServer(handler);

server.listen(3000);

gives me ~1500 req/s and in node --prof I don't see significant locking:

 [Summary]:
   ticks  total  nonlib   name
  54935   13.2%   52.7%  JavaScript
  48467   11.7%   46.5%  C++
  14309    3.4%   13.7%  GC
  310969   74.9%          Shared libraries
    761    0.2%          Unaccounted

 [C++ entry points]:
   ticks    cpp   total   name
  11996   41.7%    2.9%  writev@@GLIBC_2.2.5
   8636   30.0%    2.1%  __write@@GLIBC_2.2.5
   3711   12.9%    0.9%  __libc_malloc@@GLIBC_2.2.5
   1252    4.4%    0.3%  cfree@GLIBC_2.2.5
    534    1.9%    0.1%  setsockopt@@GLIBC_2.2.5
    503    1.7%    0.1%  operator new(unsigned long)
    362    1.3%    0.1%  operator delete(void*)

TLS info

...
Received Record
Header:
  Version = TLS 1.2 (0x303)
  Content Type = Handshake (22)
  Length = 69
    ServerHello, Length=65
      server_version=0x303 (TLS 1.2)
      Random:
        gmt_unix_time=0x5C93AD24
        random_bytes (len=28): 61F44DBFE0F5B0DD517392FB05CDF79E22671C8F512B0DF0A5FC2B2D
      session_id (len=0):
      cipher_suite {0xC0, 0x2F} TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
...
    ServerKeyExchange, Length=296
      KeyExchangeAlgorithm=ECDHE

In cpu profiler there is no SecureContext.init()

I'm assume that https.get and fetch are using same openssl version, but performance 180 req/s vs 1500 req/s

@Cow258
Copy link

Cow258 commented May 19, 2022

I am facing this issue too

My EC2 always use all the CPU resource every few days,
Than I need restart the node every few days.

Then I find out CPU usage 100% when the nodes that used web socket,
Other nodes is normal without used socket.
(I am running 12 nodes on this EC2)

Few days ago, I tried to resolve it by switched the Node.js version from v17.9.0 to v16.15.0.

image
image

@Cow258
Copy link

Cow258 commented May 19, 2022

Feb 11
I upgraded to Node.js v17.5.0 and had no issues.

Apr 15
I upgraded to Node.js v17.9.0

Apr 17,21,22,29 and May 5,7,10,14
I got high CPU usage issue

After investigating I believe the problem occurs between v17.5 and v17.9
Also I find this v17.7.2 change log about OpenSSL 3.0.2 that related to is issue.
https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V17.md#2022-03-17-version-1772-current-richardlau

But all the abnormal nodes are not using any TLS and SSL
And all exception nodes use WebSocket identically
Browser <HTTPS> CDN <HTTP> EC2 (nodejs reverse proxy) <HTTP> Node.js Server

To investigate, I plan to go back to v17.9.0 and do a remote CPU profiling when the problem occurs

@bnoordhuis
Copy link
Member

In response to #43128 (comment): the handshake indeed doesn't look materially different and the client and server settle on the same cipher/key exchange algorithm/etc. so that can't really explain it.

The fact that fetch() performs well is doubly puzzling. Can you check what the handshake looks like with fetch()?

I wonder if HTTP connection pooling / connection reuse (or lack thereof) is the culprit. How many TLS handshakes do you see with v16.x vs. v18.x (and fetch())?

@str1ke
Copy link
Author

str1ke commented May 19, 2022

@bnoordhuis Looks like I should not use fetch with this situation, because it looks like a very different beast in comparison with traditional way to make requests. It indeed use keep-alive and I don't see a way to disable it. According to fetch spec "Connection" is a forbidden header and cannot be set. Now it is clear why fetch shows such big numbers in req/s

Here is how I checked new tls sessions and connections, there might be a more correct way to count it. I'm listening tls socket events and looking to https.globalAgent._sessionCache:

...
let secureConnectCount = 0;
let tlsSessionsCount = 0;
let connectCount = 0;

  const outboundRequest = https.get("https://server.local", (response) => {
    ...
  }

  outboundRequest.on("socket", (tlsSocket) => {
    tlsSocket.on("connect", () => { connectCount++ });
    tlsSocket.on("secureConnect", () => secureConnectCount++);
    tlsSocket.on("session", (...args) => { console.log("session", args); tlsSessionsCount++ });
  });
...

Both node v16 and v18.1 creates new connection per request and performs tls handshake (secureConnect event on socket). TLS sessions are cached and only one session exists for remote server. So, if I sent 10_000 requests I got 10_000 "connect" events, 10_000 "secureConnect" events.

@str1ke
Copy link
Author

str1ke commented May 19, 2022

Another interesting thing with v.18.2 release. In same benchmark I got ~480 req/s and SecureContext.init() is taking constantly ~35% of total time in profiler without going up during test time. So, right now
v16.15.0 - ~1200 req/s
v18.1.0 - ~180 req/s
v18.2.0 - ~480 req/s

So, v18.2.0 much better and, which is more important, stable result, but I haven't tried it in production yet.

@str1ke
Copy link
Author

str1ke commented May 19, 2022

Tried to disable ssl sessions at all from nginx side with ssl_session_cache off and ssl_session_tickets off:

V16.15.0

  • 742 req/s
  • TLSSocket._start() - 12% total time
  • SecureContext.init() - 6-7% total time

v18.1.0

  • 102 req/s
  • setSession() - 15% total time
  • SecureContext.init() - 60-80% total time

v18.2.0

  • 390 req/s
  • setSession() - 32% total time
  • SecureContext.init() - 35% total time

@bnoordhuis
Copy link
Member

I spent some time investigating this and I can indeed confirm that a) it happens, and b) changes in openssl are responsible.

First off, this is v16.15.0 (recorded with perf(1) and --perf_basic_prof) at the 1% cut-off point:

+    3.79%  node     node                  [.] ssl_cipher_apply_rule
+    3.49%  node     node                  [.] sha512_block_data_order_avx2
+    2.83%  node     node                  [.] fe_mul
+    2.50%  node     libc.so.6             [.] _int_malloc
+    2.49%  node     node                  [.] x25519_fe64_sqr
+    2.03%  node     libc.so.6             [.] malloc
+    1.84%  node     libc.so.6             [.] __strncmp_avx2
+    1.70%  node     libc.so.6             [.] _int_free
+    1.30%  node     node                  [.] v8::internal::StringTable::Data::TryStringToIndexOrLookupExisting<unsigned char>
+    1.07%  node     node                  [.] v8::internal::Scavenger::ScavengeObject<v8::internal::FullHeapObjectSlot>
+    1.00%  node     libc.so.6             [.] cfree@GLIBC_2.2.5

And this is v18.2.0:

+    5.78%  node     node                  [.] do_name
+    5.35%  node     node                  [.] OPENSSL_LH_retrieve
+    4.88%  node     libc.so.6             [.] pthread_rwlock_unlock@@GLIBC_2.34
+    4.60%  node     libc.so.6             [.] pthread_rwlock_rdlock@GLIBC_2.2.5
+    3.35%  node     node                  [.] OPENSSL_LH_doall_arg
+    3.07%  node     libc.so.6             [.] malloc
+    2.96%  node     node                  [.] ossl_lh_strcasehash
+    2.26%  node     libc.so.6             [.] __strcmp_avx2
+    2.11%  node     libc.so.6             [.] _int_free
+    1.57%  node     libc.so.6             [.] __strcasecmp_l_avx
+    1.41%  node     libc.so.6             [.] _int_malloc
+    1.37%  node     node                  [.] ssl_cipher_apply_rule
+    1.31%  node     node                  [.] ossl_sa_doall_arg
+    1.27%  node     node                  [.] CRYPTO_strndup
+    1.17%  node     node                  [.] ossl_lib_ctx_get_data
+    1.12%  node     node                  [.] sha512_block_data_order_avx2
+    1.10%  node     node                  [.] OPENSSL_LH_strhash
+    1.04%  node     libc.so.6             [.] cfree@GLIBC_2.2.5

It's almost all openssl populating hash tables while holding a lock.

This is all coming from SecureContext.init() and setSession() (and internal function addCACerts()) as noted by @str1ke.

The most common call graph for do_name() looks like this:

     do_name                                                                                                                                                                   ▒
   - ossl_namemap_doall_names                                                                                                                                                  ▒
      - 3.90% evp_cipher_from_algorithm                                                                                                                                        ▒
           construct_evp_method                                                                                                                                                ▒
           ossl_method_construct_this                                                                                                                                          ▒
           algorithm_do_this                                                                                                                                                   ▒
           ossl_provider_doall_activated                                                                                                                                       ▒
           ossl_algorithm_do_all                                                                                                                                               ▒
           ossl_method_construct                                                                                                                                               ▒
           inner_evp_generic_fetch.constprop.3                                                                                                                                 ▒
           evp_generic_fetch                                                                                                                                                   ▒
           EVP_CIPHER_fetch                                                                                                                                                    ▒
           ssl_evp_cipher_fetch                                                                                                                                                ▒
           ssl_load_ciphers                                                                                                                                                    ▒
           SSL_CTX_new_ex                                                                                                                                                      ▒
         - node::crypto::SecureContext::Init 

And for OPENSSL_LH_retrieve():

-    5.35%  node     node                  [.] OPENSSL_LH_retrieve                                                                                                             ▒
   - OPENSSL_LH_retrieve                                                                                                                                                       ▒
      - 2.25% ossl_namemap_name2num                                                                                                                                            ▒
         - 1.97% OSSL_DECODER_is_a                                                                                                                                             ▒
            - 1.74% OSSL_DECODER_CTX_add_extra                                                                                                                                 ▒
                 OSSL_DECODER_CTX_new_for_pkey                                                                                                                                 ▒
                 x509_pubkey_ex_d2i_ex                                                                                                                                         ▒
                 asn1_item_embed_d2i                                                                                                                                           ▒
                 asn1_template_noexp_d2i                                                                                                                                       ▒
                 asn1_template_ex_d2i                                                                                                                                          ▒
                 asn1_item_embed_d2i                                                                                                                                           ▒
                 asn1_template_noexp_d2i                                                                                                                                       ▒
                 asn1_template_ex_d2i                                                                                                                                          ▒
               - asn1_item_embed_d2i                                                                                                                                           ▒
                  - 1.13% asn1_template_noexp_d2i                                                                                                                              ▒
                       asn1_template_ex_d2i                                                                                                                                    ▒
                       asn1_item_embed_d2i                                                                                                                                     ▒
                       ASN1_item_d2i                                                                                                                                           ▒
                       d2i_SSL_SESSION                                                                                                                                         ▒
                       node::crypto::GetTLSSession                                                                                                                             ▒
                     - node::crypto::TLSWrap::SetSession                                                                                                                       ▒
                          0.81% Builtins_CallApiCallback                                                                                                                       ▒
                  - 0.61% ASN1_item_d2i                                                                                                                                        ▒
                       d2i_X509_AUX                                                                                                                                            ▒
                       PEM_ASN1_read_bio                                                                                                                                       ▒
                     - node::crypto::SecureContext::AddCACert                                                                                                                  ▒
                        - 0.57% v8::internal::(anonymous namespace)::HandleApiCallHelper<false>                                                                                ▒
                             v8::internal::Builtin_HandleApiCall                                                                                                               ▒
                             Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit                                                                                    ▒
                             Function:^ node:internal/tls/secure-context:56                                                                                                    ▒
                             Builtins_ArrayForEach        

I've been going over open performance issues in openssl's bug tracker but I don't see an exact match. openssl/openssl#17627 is the bucket issue. Another pair of eyes would be most welcome.

@nodejs/quic How do we confirm this isn't caused by the QUIC patch we carry?

@bnoordhuis bnoordhuis added confirmed-bug Issues with confirmed bugs. tls Issues and PRs related to the tls subsystem. labels May 20, 2022
@mcollina
Copy link
Member

I'm assume that https.get and fetch are using same openssl version, but performance 180 req/s vs 1500 req/s

fetch() will reuse the TLS session by default, while https.get() will establish a new one. You can likely simulate a similar setup using https.Agent().

@str1ke
Copy link
Author

str1ke commented May 20, 2022

@mcollina from my observations fetch() and https.get() would reuse TLS sessions by default. The difference is that by default fetch() would use keep-alive, but https.get() not. And it is not possible manage "connection" header with fetch().

To put fetch() and https.get() in similar conditions I had to disable keep-alive on nginx side with keepalive_timeout 0. In that conditions fetch() and https.get() perform at similar level (fetch slower by 10%).

To ensure that there is no keep-alive connections I added to nginx access logs information about connections, number of requests within connection, ssl sessions, ciphers and connection header. It looks like this

[20/May/2022:17:11:05 +0000] "GET / HTTP/1.1" 200 connection=504978 connection_requests=1 protocol=HTTP/1.1 connection=keep-alive ssl_session_id=f6eaf675369fa0a972188eb614b84cb3d53282f749d08cb09709ad70d0f8b7e5 ssl_cipher=ECDHE-RSA-AES128-GCM-SHA256 ssl_session_reused=r ssl_protocol=TLSv1.2
[20/May/2022:17:11:05 +0000] "GET / HTTP/1.1" 200 connection=504979 connection_requests=1 protocol=HTTP/1.1 connection=keep-alive ssl_session_id=f6eaf675369fa0a972188eb614b84cb3d53282f749d08cb09709ad70d0f8b7e5 ssl_cipher=ECDHE-RSA-AES128-GCM-SHA256 ssl_session_reused=r ssl_protocol=TLSv1.2

Both reuse tls session, and despite fetch sends keep-alive, nginx does not allow it and number of connections increased and always one request per connection.

@mhdawson
Copy link
Member

https://github.com/orgs/nodejs/teams/quic How do we confirm this isn't caused by the QUIC patch we carry?

@jasnell is there any dependency left on the QUIC patches. If I remember correctly you removed it and #38233 is the PR to add it back. If there is no dependency then we should be able to build from the OpenSSL source instead of the QUIC branch. That might be a way to confirm.

@OzySky
Copy link

OzySky commented Dec 17, 2022

Any update on this issue?

@bnoordhuis
Copy link
Member

I can't be 100% sure but preponderance of evidence suggests the QUIC patches aren't responsible because they don't touch that part of openssl.

In other words, it's likely an upstream issue. We don't have control over that so I'm going to close this. If someone wants to pursue this upstream, please do.

@bnoordhuis bnoordhuis closed this as not planned Won't fix, can't repro, duplicate, stale Dec 29, 2022
@paulmenzel
Copy link

Please excuse my ignorance, but is the theory, that upgrading OpenSSL from 1.1.1 to 3.x caused the issue? If so, could OpenSSL 1.1.1 still be bundled/used?

@bnoordhuis
Copy link
Member

@paulmenzel Yes, the upgrade to 3.x is the likely culprit. Building from source and linking dynamically against v1.1.1 should work; it's tested by node's CI matrix.

@krk
Copy link
Contributor

krk commented Dec 29, 2022

This could be related to openssl/openssl#15199, which is fixed in openssl/openssl#17881, which would be part of OpenSSL 3.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed-bug Issues with confirmed bugs. tls Issues and PRs related to the tls subsystem.
Projects
None yet
Development

No branches or pull requests

8 participants