bug: when apisix starts for a while, its communication with etcd starts to time out #7078

hansedong · 2022-05-19T03:24:29Z

Current Behavior

I have encountered a problem in apisix that cannot communicate properly with etcd through the admin api.

There are 3 etcd nodes in my environment. When I start apisix, everything is normal. Moreover, I can also operate resources such as Route and Upstreams through the apisix admin api.

However, after a period of time (the specific time is uncertain, generally a few hours), operations via the apisix admin api will time out.

At this point, the following error will appear in the apisix log:

2022/05/19 10:03:03 [warn] 13654#13654: *6638791 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:08 [warn] 13650#13650: *6635827 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:09 [warn] 13650#13650: *6643178 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] health_check.lua:90: report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:20 [warn] 13655#13655: *6644402 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "PUT /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:20 [warn] 13650#13650: *6637242 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] health_check.lua:90: report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:32 [warn] 13650#13650: *6645920 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "PUT /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:33 [warn] 13654#13654: *6638791 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"

10.152.6.32:2379 is a etcd node

At this time, the request I send to the apisix admin api will also get stuck:

[root@knode10-132-14-202 operation]# curl  -v ""http://adminapi-apisix.xxx.com:9180/apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80""  -H 'X-API-KEY: 3f38ce3d332sffc6418d9351085fb61d'
* About to connect() to 10.132.15.138 port 9180 (#0)
*   Trying 10.132.15.138...
* Connected to 10.132.15.138 (10.132.15.138) port 9180 (#0)
> GET /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.132.15.138:9180
> Accept: */*
> X-API-KEY: 3f38ce3adf742fc6418d9351085fb61d
>

# Here it will get stuck, without any output

However, there is no problem operating etcd via etcdctl.

[root@knode10-152-6-32 operation]# ETCDCTL_API=3 /usr/local/bin/etcdctl --cacert=/etc/etcd/etcdSSL/ca.pem   --cert=/etc/etcd/etcdSSL/clientssl/apisix/apisix.pem --key=/etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --endpoints=https://10.152.6.32:2379 get --prefix /apisix/upstreams/ms-aos-xxxappg-gp-418241-80

ms-aos-xxxappg-gp-418241-80
{"update_time":1652925676,"labels":{"env-type":"staging","appcode":"xxxappg","platform":"bh","version":"10006289"},"create_time":1652754430,"type":"roundrobin","pass_host":"pass","nodes":{"10.131.110.124:80":10},"scheme":"http","hash_on":"vars","id":"ms-aos-xxxappg-gp-418241-80","desc":"upstream for appcode: xxxappg and version: 10006289","name":"ms-aos-xxxappg-gp-418241-80"}

This etcd node can be operated through etcdctl, so from the perspective of etcd cluster, etcd is normally served. I also tried to operate the same etcd node with the etcd watch command, and the command was executed normally. So, I guess, there is something wrong with apisix's watch query on etcd.

Then, after I adjusted the log level of etcd cluster from info to debug, I saw the following logs in etcd:

ay 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.156+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:456","msg":"failed to send watch control response to gRPC stream","error":"rpc error: code = Unavailable desc = transport is closing"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.158+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}

When I don't start apisix again, etcd does not have the above error log. After starting apisix for a period of time, watch-related errors begin to appear, and the time range is consistent with the error log in apisix error.log.

My configuration file content is as follows:

...
  enable_resolv_search_opt: true  # enable search option in resolv.conf
  ssl:
    enable: true
    listen:                       # APISIX listening port in https.
      - 443
    #   - port: 9444
    #     enable_http2: true      # If not set, the default value is `false`.
    #   - ip: 127.0.0.3           # Specific IP, If not set, the default value is `0.0.0.0`.
    #     port: 9445
    #     enable_http2: true
    enable_http2: true            # Not recommend: This parameter should be set via the `listen`.
    # listen_port: 9443           # Not recommend: This parameter should be set via the `listen`.
    ssl_trusted_certificate: /etc/etcd/etcdSSL/ca.pem  # Specifies a file path with trusted CA certificates in the PEM format
...
etcd:
  host:                           # it's possible to define multiple etcd hosts addresses of the same etcd cluster.
    - "https://10.152.6.32:2379"   # multiple etcd address, if your etcd cluster enables TLS, please use https scheme,
    - "https://10.152.6.33:2379"   # e.g. https://127.0.0.1:2379.
    - "https://10.152.6.34:2379"

  prefix: /apisix                 # apisix configurations prefix
  timeout: 30                     # 30 seconds
  #resync_delay: 5                # when sync failed and a rest is needed, resync after the configured seconds plus 50% random jitter
  #health_check_timeout: 10       # etcd retry the unhealthy nodes after the configured seconds
  #user: root                     # root username for etcd
  #password: 5tHkHhYkjr6cQY       # root password for etcd
  tls:
    # To enable etcd client certificate you need to build APISIX-OpenResty, see
    # https://apisix.apache.org/docs/apisix/how-to-build/#step-6-build-openresty-for-apache-apisix
    cert: /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem    # path of certificate used by the etcd client
    key: /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem     # path of key used by the etcd client

    verify: false                  # whether to verify the etcd endpoint certificate when setup a TLS connection to etcd,
                                  # the default value is true, e.g. the certificate will be verified strictly.
    sni: etcdserver-apisix-staging.mfwdev.com                         # the SNI for etcd TLS requests. If missed, the host part of the URL will be used.

My etcd certificate is private, but I'm sure my etcd certificate is fine, otherwise apisix wouldn't be able to communicate with etcd properly in the beginning.

Also, when there is a problem with apisix communicating with etcd, I also get stuck executing apisix init_etcd without any output.

What I want to ask is, for this kind of problem, how should I troubleshoot it, and what is the possible point of the problem?

Expected Behavior

No response

Error Logs

apisix error log:

2022/05/19 10:03:03 [warn] 13654#13654: *6638791 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:08 [warn] 13650#13650: *6635827 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:09 [warn] 13650#13650: *6643178 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] health_check.lua:90: report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:20 [warn] 13655#13655: *6644402 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "PUT /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:20 [warn] 13650#13650: *6637242 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] health_check.lua:90: report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:32 [warn] 13650#13650: *6645920 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "PUT /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:33 [warn] 13654#13654: *6638791 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"

etcd error log:

ay 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.156+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:456","msg":"failed to send watch control response to gRPC stream","error":"rpc error: code = Unavailable desc = transport is closing"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.158+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}a

Steps to Reproduce

I'm not sure how to reproduce this issue

Environment

APISIX version (run apisix version): 2.13.1
Operating system (run uname -a): Linux knode10-132-15-138 4.14.105-19-0023 #1 SMP Mon Jan 10 17:53:54 CST 2022 x86_64 x86_64 x86_64 GNU/Linux
OpenResty / Nginx version (run openresty -V or nginx -V):

- nginx version: openresty/1.19.9.1
built by gcc 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)
built with OpenSSL 1.1.1n  15 Mar 2022
TLS SNI support enabled
configure arguments: --prefix=/usr/local/openresty/nginx --with-cc-opt='-O2 -DAPISIX_BASE_VER=1.19.9.1.5 -DNGX_LUA_ABORT_AT_PANIC -I/usr/local/openresty/zlib/include -I/usr/local/openresty/pcre/include -I/usr/local/openresty/openssl111/include' --add-module=../ngx_devel_kit-0.3.1 --add-module=../echo-nginx-module-0.62 --add-module=../xss-nginx-module-0.06 --add-module=../ngx_coolkit-0.2 --add-module=../set-misc-nginx-module-0.32 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.08 --add-module=../srcache-nginx-module-0.32 --add-module=../ngx_lua-0.10.20 --add-module=../ngx_lua_upstream-0.07 --add-module=../headers-more-nginx-module-0.33 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.19 --add-module=../redis2-nginx-module-0.15 --add-module=../redis-nginx-module-0.3.7 --add-module=../ngx_stream_lua-0.0.10 --with-ld-opt='-Wl,-rpath,/usr/local/openresty/luajit/lib -Wl,-rpath,/usr/local/openresty/wasmtime-c-api/lib -L/usr/local/openresty/zlib/lib -L/usr/local/openresty/pcre/lib -L/usr/local/openresty/openssl111/lib -Wl,-rpath,/usr/local/openresty/zlib/lib:/usr/local/openresty/pcre/lib:/usr/local/openresty/openssl111/lib' --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../mod_dubbo --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../ngx_multi_upstream_module --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../apisix-nginx-module --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../apisix-nginx-module/src/stream --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../wasm-nginx-module --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../lua-var-nginx-module --with-poll_module --with-pcre-jit --with-stream --with-stream_ssl_module --with-stream_ssl_preread_module --with-http_v2_module --without-mail_pop3_module --without-mail_imap_module --without-mail_smtp_module --with-http_stub_status_module --with-http_realip_module --with-http_addition_module --with-http_auth_request_module --with-http_secure_link_module --with-http_random_index_module --with-http_gzip_static_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-threads --with-compat --with-stream --with-http_ssl_module

etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): {"etcdserver":"3.5.4","etcdcluster":"3.5.0"}
APISIX Dashboard version, if relevant:
Plugin runner version, for issues related to plugin runners:
LuaRocks version, for installation issues (run luarocks --version):

The text was updated successfully, but these errors were encountered:

tokers · 2022-05-19T09:32:06Z

failed to receive watch request from gRPC stream

This error will be reported by ETCD when it cannot receive Watch requests from the client successfully (maybe the client was timed out and the connection was closed).

Do you have any monitoring data about the networking between APISIX and ETCD? Something like network saturation ratio, errors and bandwidth utilization are desired.

hansedong · 2022-05-19T09:41:41Z

This error will be reported by ETCD when it cannot receive Watch requests from the client successfully (maybe the client was timed out and the connection was closed).

I thought about this possibility, however, etcdctl can execute smoothly during the period of time when apisix cannot communicate with etcd. Also, when I restarted apisix, the problem disappeared immediately and reappeared after a while.

Do you have any monitoring data about the networking between APISIX and ETCD? Something like network saturation ratio, errors and bandwidth utilization are desired.

Good idea, I'll see if I can spot something through the monitoring system.

tokers · 2022-05-19T09:43:30Z

I thought about this possibility, however, etcdctl can execute smoothly during the period of time when apisix cannot communicate with etcd. Also, when I restarted apisix, the problem disappeared immediately and reappeared after a while.

Note the use of ETCDCTL and APISIX is not same, ETCDCTL just uses the gRPC service while APISIX relies on the ETCD gRPC Gateway, it sends Rest requests, all Rest requests are converted to gRPC streams by gRPC Gateway (it's embedded in the ETCD server).

hansedong · 2022-05-19T09:45:38Z

Note the use of ETCDCTL and APISIX is not same, ETCDCTL just uses the gRPC service while APISIX relies on the ETCD gRPC Gateway, it sends Rest requests, all Rest requests are converted to gRPC streams by gRPC Gateway (it's embedded in the ETCD server).

Thanks for the reminder, I understand that. I just want to point out that maybe the root cause of this problem is not necessarily on etcd.

tokers · 2022-05-20T02:05:30Z

Note the use of ETCDCTL and APISIX is not same, ETCDCTL just uses the gRPC service while APISIX relies on the ETCD gRPC Gateway, it sends Rest requests, all Rest requests are converted to gRPC streams by gRPC Gateway (it's embedded in the ETCD server).

Thanks for the reminder, I understand that. I just want to point out that maybe the root cause of this problem is not necessarily on etcd.

Hard to judge. If you have captured some network packets. Would you like to share them, this maybe helpful for the troubleshooting.

hansedong · 2022-05-20T02:08:32Z

@tokers It is very exaggerated that the connections of apisix to etcd in the ESTABLISHED state is very high and has been rising all the time.

[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:03 CST 2022
631
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:07 CST 2022
645
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:11 CST 2022
660
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:06:48 CST 2022
754
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:07:16 CST 2022
782

5078 is the pid of etcd, and 10.132.15.138 is the IP of apisix.

tokers · 2022-05-20T02:22:31Z

@tokers It is very exaggerated that the connections of apisix to etcd in the ESTABLISHED state is very high and has been rising all the time.

[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:03 CST 2022
631
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:07 CST 2022
645
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:11 CST 2022
660
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:06:48 CST 2022
754
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:07:16 CST 2022
782

5078 is the pid of etcd, and 10.132.15.138 is the IP of apisix.

How many APISIX worker processed that you created?

hansedong · 2022-05-20T02:25:52Z

How many APISIX worker processed that you created?

[root@knode10-132-15-138 operation]# ps -ef |grep nginx
root     13648     1  0 May18 ?        00:00:00 nginx: master process openresty -p /usr/local/apisix -c /usr/local/apisix/conf/nginx.conf
nobody   13649 13648  0 May18 ?        00:05:46 nginx: worker process
nobody   13650 13648  0 May18 ?        00:06:22 nginx: worker process
nobody   13651 13648  0 May18 ?        00:05:50 nginx: worker process
nobody   13652 13648  0 May18 ?        00:05:46 nginx: worker process
nobody   13653 13648  0 May18 ?        00:05:46 nginx: worker process
nobody   13654 13648  0 May18 ?        00:05:52 nginx: worker process
nobody   13655 13648  0 May18 ?        00:06:06 nginx: worker process
nobody   13656 13648  0 May18 ?        00:06:03 nginx: worker process
nobody   13657 13648  0 May18 ?        00:00:00 nginx: cache manager process
root     13659 13648  0 May18 ?        00:05:43 nginx: privileged agent process

The CPU of my host is 8 cores, so the number of nginx workers is 8. My apisix configuration is as follows, and the number of workers is as expected:

nginx_config:                     # config for render the template to generate nginx.conf
  #user: root                     # specifies the execution user of the worker process.
                                  # the "user" directive makes sense only if the master process runs with super-user privileges.
                                  # if you're not root user,the default is current user.
  error_log: /DATA1/apisix/logs/error.log
  error_log_level:  warn          # warn,error
  worker_processes: auto          # if you want use multiple cores in container, you can inject the number of cpu as environment variable "APISIX_WORKER_PROCESSES"

tzssangglass · 2022-05-23T02:02:03Z

From experience, this is a problem with etcd and I suggest you go to the etcd repository for help.

hansedong · 2022-05-23T02:06:26Z

@tzssangglass @tokers

Thanks for the reply, I have located the cause of the problem is not on apisix.

When I use the curl command line to directly call the API of grpc-gateway, the request is also blocked, so the cause of the problem is indeed on etcd. I plan to close this issue later.

Thanks again everyone for the replies, apisix is a great product.

Wang-Kai · 2022-09-13T12:01:42Z

@tzssangglass @tokers

Thanks for the reply, I have located the cause of the problem is not on apisix.

When I use the curl command line to directly call the API of grpc-gateway, the request is also blocked, so the cause of the problem is indeed on etcd. I plan to close this issue later.

Thanks again everyone for the replies, apisix is a great product.

So, how do you solved the problem finally? my issue is the same with you .

@hansedong

hansedong · 2022-09-13T12:40:41Z

@Wang-Kai

The root cause of this problem is ETCD's bug. ETCD's HTTP/2-based https connections are limited.

The official version 3.5.5 has not yet been released, but it has been fixed in branch 3.4 and a new version 3.4.20 has been released.

For branch 3.5, you can clone the ETCD source code and compile the release-3.5 branch directly (this branch has fixed the problem of HTTP/2 connections).

The way to recompile ETCD is as follows：

git checkout release-3.5
make GOOS=linux GOARCH=amd64

For more information, check out this issue: etcd-io/etcd#14169

Wang-Kai · 2022-09-15T11:39:19Z

@Wang-Kai

The root cause of this problem is ETCD's bug. ETCD's HTTP/2-based https connections are limited.

The official version 3.5.5 has not yet been released, but it has been fixed in branch 3.4 and a new version 3.4.20 has been released.

For branch 3.5, you can clone the ETCD source code and compile the release-3.5 branch directly (this branch has fixed the problem of HTTP/2 connections).

The way to recompile ETCD is as follows：
git checkout release-3.5
make GOOS=linux GOARCH=amd64
For more information, check out this issue: etcd-io/etcd#14169

@hansedong Your resolution works for me extremely. Thanks so much !

hansedong closed this as completed May 23, 2022

hansedong mentioned this issue Jul 6, 2022

the prometheus metrics API is tool slow #7353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: when apisix starts for a while, its communication with etcd starts to time out #7078

bug: when apisix starts for a while, its communication with etcd starts to time out #7078

hansedong commented May 19, 2022 •

edited

Loading

tokers commented May 19, 2022

hansedong commented May 19, 2022

tokers commented May 19, 2022

hansedong commented May 19, 2022

tokers commented May 20, 2022

hansedong commented May 20, 2022 •

edited

Loading

tokers commented May 20, 2022

hansedong commented May 20, 2022

tzssangglass commented May 23, 2022

hansedong commented May 23, 2022

Wang-Kai commented Sep 13, 2022

hansedong commented Sep 13, 2022

Wang-Kai commented Sep 15, 2022 •

edited

Loading

bug: when apisix starts for a while, its communication with etcd starts to time out #7078

bug: when apisix starts for a while, its communication with etcd starts to time out #7078

Comments

hansedong commented May 19, 2022 • edited Loading

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

tokers commented May 19, 2022

hansedong commented May 19, 2022

tokers commented May 19, 2022

hansedong commented May 19, 2022

tokers commented May 20, 2022

hansedong commented May 20, 2022 • edited Loading

tokers commented May 20, 2022

hansedong commented May 20, 2022

tzssangglass commented May 23, 2022

hansedong commented May 23, 2022

Wang-Kai commented Sep 13, 2022

hansedong commented Sep 13, 2022

Wang-Kai commented Sep 15, 2022 • edited Loading

hansedong commented May 19, 2022 •

edited

Loading

hansedong commented May 20, 2022 •

edited

Loading

Wang-Kai commented Sep 15, 2022 •

edited

Loading