Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: when apisix starts for a while, its communication with etcd starts to time out #7078

Closed
hansedong opened this issue May 19, 2022 · 13 comments

Comments

@hansedong
Copy link
Contributor

hansedong commented May 19, 2022

Current Behavior

I have encountered a problem in apisix that cannot communicate properly with etcd through the admin api.

There are 3 etcd nodes in my environment. When I start apisix, everything is normal. Moreover, I can also operate resources such as Route and Upstreams through the apisix admin api.

However, after a period of time (the specific time is uncertain, generally a few hours), operations via the apisix admin api will time out.

At this point, the following error will appear in the apisix log:

2022/05/19 10:03:03 [warn] 13654#13654: *6638791 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:08 [warn] 13650#13650: *6635827 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:09 [warn] 13650#13650: *6643178 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] health_check.lua:90: report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:20 [warn] 13655#13655: *6644402 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "PUT /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:20 [warn] 13650#13650: *6637242 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] health_check.lua:90: report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:32 [warn] 13650#13650: *6645920 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "PUT /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:33 [warn] 13654#13654: *6638791 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"

10.152.6.32:2379 is a etcd node

At this time, the request I send to the apisix admin api will also get stuck:

[root@knode10-132-14-202 operation]# curl  -v ""http://adminapi-apisix.xxx.com:9180/apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80""  -H 'X-API-KEY: 3f38ce3d332sffc6418d9351085fb61d'
* About to connect() to 10.132.15.138 port 9180 (#0)
*   Trying 10.132.15.138...
* Connected to 10.132.15.138 (10.132.15.138) port 9180 (#0)
> GET /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.132.15.138:9180
> Accept: */*
> X-API-KEY: 3f38ce3adf742fc6418d9351085fb61d
>

# Here it will get stuck, without any output

However, there is no problem operating etcd via etcdctl.

[root@knode10-152-6-32 operation]# ETCDCTL_API=3 /usr/local/bin/etcdctl --cacert=/etc/etcd/etcdSSL/ca.pem   --cert=/etc/etcd/etcdSSL/clientssl/apisix/apisix.pem --key=/etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --endpoints=https://10.152.6.32:2379 get --prefix /apisix/upstreams/ms-aos-xxxappg-gp-418241-80

ms-aos-xxxappg-gp-418241-80
{"update_time":1652925676,"labels":{"env-type":"staging","appcode":"xxxappg","platform":"bh","version":"10006289"},"create_time":1652754430,"type":"roundrobin","pass_host":"pass","nodes":{"10.131.110.124:80":10},"scheme":"http","hash_on":"vars","id":"ms-aos-xxxappg-gp-418241-80","desc":"upstream for appcode: xxxappg and version: 10006289","name":"ms-aos-xxxappg-gp-418241-80"}

This etcd node can be operated through etcdctl, so from the perspective of etcd cluster, etcd is normally served. I also tried to operate the same etcd node with the etcd watch command, and the command was executed normally. So, I guess, there is something wrong with apisix's watch query on etcd.

Then, after I adjusted the log level of etcd cluster from info to debug, I saw the following logs in etcd:

ay 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.156+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:456","msg":"failed to send watch control response to gRPC stream","error":"rpc error: code = Unavailable desc = transport is closing"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.158+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}

When I don't start apisix again, etcd does not have the above error log. After starting apisix for a period of time, watch-related errors begin to appear, and the time range is consistent with the error log in apisix error.log.

My configuration file content is as follows:

...
  enable_resolv_search_opt: true  # enable search option in resolv.conf
  ssl:
    enable: true
    listen:                       # APISIX listening port in https.
      - 443
    #   - port: 9444
    #     enable_http2: true      # If not set, the default value is `false`.
    #   - ip: 127.0.0.3           # Specific IP, If not set, the default value is `0.0.0.0`.
    #     port: 9445
    #     enable_http2: true
    enable_http2: true            # Not recommend: This parameter should be set via the `listen`.
    # listen_port: 9443           # Not recommend: This parameter should be set via the `listen`.
    ssl_trusted_certificate: /etc/etcd/etcdSSL/ca.pem  # Specifies a file path with trusted CA certificates in the PEM format
...
etcd:
  host:                           # it's possible to define multiple etcd hosts addresses of the same etcd cluster.
    - "https://10.152.6.32:2379"   # multiple etcd address, if your etcd cluster enables TLS, please use https scheme,
    - "https://10.152.6.33:2379"   # e.g. https://127.0.0.1:2379.
    - "https://10.152.6.34:2379"

  prefix: /apisix                 # apisix configurations prefix
  timeout: 30                     # 30 seconds
  #resync_delay: 5                # when sync failed and a rest is needed, resync after the configured seconds plus 50% random jitter
  #health_check_timeout: 10       # etcd retry the unhealthy nodes after the configured seconds
  #user: root                     # root username for etcd
  #password: 5tHkHhYkjr6cQY       # root password for etcd
  tls:
    # To enable etcd client certificate you need to build APISIX-OpenResty, see
    # https://apisix.apache.org/docs/apisix/how-to-build/#step-6-build-openresty-for-apache-apisix
    cert: /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem    # path of certificate used by the etcd client
    key: /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem     # path of key used by the etcd client

    verify: false                  # whether to verify the etcd endpoint certificate when setup a TLS connection to etcd,
                                  # the default value is true, e.g. the certificate will be verified strictly.
    sni: etcdserver-apisix-staging.mfwdev.com                         # the SNI for etcd TLS requests. If missed, the host part of the URL will be used.

My etcd certificate is private, but I'm sure my etcd certificate is fine, otherwise apisix wouldn't be able to communicate with etcd properly in the beginning.

Also, when there is a problem with apisix communicating with etcd, I also get stuck executing apisix init_etcd without any output.

What I want to ask is, for this kind of problem, how should I troubleshoot it, and what is the possible point of the problem?

Expected Behavior

No response

Error Logs

apisix error log:

2022/05/19 10:03:03 [warn] 13654#13654: *6638791 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:08 [warn] 13650#13650: *6635827 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:09 [warn] 13650#13650: *6643178 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] health_check.lua:90: report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:15 [warn] 13650#13650: *6640270 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:20 [warn] 13655#13655: *6644402 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "PUT /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:20 [warn] 13650#13650: *6637242 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] health_check.lua:90: report_failure(): update endpoint: https://10.152.6.32:2379 to unhealthy, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:27 [warn] 13650#13650: *6641630 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:32 [warn] 13650#13650: *6645920 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "PUT /apisix/admin/upstreams/ms-aos-xxxappg-gp-418241-80 HTTP/1.1", host: "10.132.63.69:9180"
2022/05/19 10:03:33 [warn] 13654#13654: *6638791 [lua] v3.lua:151: _request_uri(): https://10.152.6.32:2379: timeout. Retrying, client: 10.132.14.202, server: , request: "GET /apisix/admin/upstreams/ms-aos-xxxapp-staging-gp-418436-80 HTTP/1.1", host: "10.132.63.69:9180"

etcd error log:

ay 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.156+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.157+0800","caller":"v3rpc/watch.go:456","msg":"failed to send watch control response to gRPC stream","error":"rpc error: code = Unavailable desc = transport is closing"}
May 19 10:57:09 knode10-152-6-32 etcd[5078]: {"level":"debug","ts":"2022-05-19T10:57:09.158+0800","caller":"v3rpc/watch.go:191","msg":"failed to receive watch request from gRPC stream","error":"rpc error: code = Canceled desc = context canceled"}a

Steps to Reproduce

I'm not sure how to reproduce this issue

Environment

  • APISIX version (run apisix version): 2.13.1
  • Operating system (run uname -a): Linux knode10-132-15-138 4.14.105-19-0023 #1 SMP Mon Jan 10 17:53:54 CST 2022 x86_64 x86_64 x86_64 GNU/Linux
  • OpenResty / Nginx version (run openresty -V or nginx -V):
- nginx version: openresty/1.19.9.1
built by gcc 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)
built with OpenSSL 1.1.1n  15 Mar 2022
TLS SNI support enabled
configure arguments: --prefix=/usr/local/openresty/nginx --with-cc-opt='-O2 -DAPISIX_BASE_VER=1.19.9.1.5 -DNGX_LUA_ABORT_AT_PANIC -I/usr/local/openresty/zlib/include -I/usr/local/openresty/pcre/include -I/usr/local/openresty/openssl111/include' --add-module=../ngx_devel_kit-0.3.1 --add-module=../echo-nginx-module-0.62 --add-module=../xss-nginx-module-0.06 --add-module=../ngx_coolkit-0.2 --add-module=../set-misc-nginx-module-0.32 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.08 --add-module=../srcache-nginx-module-0.32 --add-module=../ngx_lua-0.10.20 --add-module=../ngx_lua_upstream-0.07 --add-module=../headers-more-nginx-module-0.33 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.19 --add-module=../redis2-nginx-module-0.15 --add-module=../redis-nginx-module-0.3.7 --add-module=../ngx_stream_lua-0.0.10 --with-ld-opt='-Wl,-rpath,/usr/local/openresty/luajit/lib -Wl,-rpath,/usr/local/openresty/wasmtime-c-api/lib -L/usr/local/openresty/zlib/lib -L/usr/local/openresty/pcre/lib -L/usr/local/openresty/openssl111/lib -Wl,-rpath,/usr/local/openresty/zlib/lib:/usr/local/openresty/pcre/lib:/usr/local/openresty/openssl111/lib' --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../mod_dubbo --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../ngx_multi_upstream_module --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../apisix-nginx-module --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../apisix-nginx-module/src/stream --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../wasm-nginx-module --add-module=/tmp/tmp.HOhTx9UT8p/openresty-1.19.9.1/../lua-var-nginx-module --with-poll_module --with-pcre-jit --with-stream --with-stream_ssl_module --with-stream_ssl_preread_module --with-http_v2_module --without-mail_pop3_module --without-mail_imap_module --without-mail_smtp_module --with-http_stub_status_module --with-http_realip_module --with-http_addition_module --with-http_auth_request_module --with-http_secure_link_module --with-http_random_index_module --with-http_gzip_static_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-threads --with-compat --with-stream --with-http_ssl_module
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): {"etcdserver":"3.5.4","etcdcluster":"3.5.0"}
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):
@tokers
Copy link
Contributor

tokers commented May 19, 2022

failed to receive watch request from gRPC stream

This error will be reported by ETCD when it cannot receive Watch requests from the client successfully (maybe the client was timed out and the connection was closed).

Do you have any monitoring data about the networking between APISIX and ETCD? Something like network saturation ratio, errors and bandwidth utilization are desired.

@hansedong
Copy link
Contributor Author

This error will be reported by ETCD when it cannot receive Watch requests from the client successfully (maybe the client was timed out and the connection was closed).

I thought about this possibility, however, etcdctl can execute smoothly during the period of time when apisix cannot communicate with etcd. Also, when I restarted apisix, the problem disappeared immediately and reappeared after a while.

Do you have any monitoring data about the networking between APISIX and ETCD? Something like network saturation ratio, errors and bandwidth utilization are desired.

Good idea, I'll see if I can spot something through the monitoring system.

@tokers
Copy link
Contributor

tokers commented May 19, 2022

I thought about this possibility, however, etcdctl can execute smoothly during the period of time when apisix cannot communicate with etcd. Also, when I restarted apisix, the problem disappeared immediately and reappeared after a while.

Note the use of ETCDCTL and APISIX is not same, ETCDCTL just uses the gRPC service while APISIX relies on the ETCD gRPC Gateway, it sends Rest requests, all Rest requests are converted to gRPC streams by gRPC Gateway (it's embedded in the ETCD server).

@hansedong
Copy link
Contributor Author

Note the use of ETCDCTL and APISIX is not same, ETCDCTL just uses the gRPC service while APISIX relies on the ETCD gRPC Gateway, it sends Rest requests, all Rest requests are converted to gRPC streams by gRPC Gateway (it's embedded in the ETCD server).

Thanks for the reminder, I understand that. I just want to point out that maybe the root cause of this problem is not necessarily on etcd.

@tokers
Copy link
Contributor

tokers commented May 20, 2022

Note the use of ETCDCTL and APISIX is not same, ETCDCTL just uses the gRPC service while APISIX relies on the ETCD gRPC Gateway, it sends Rest requests, all Rest requests are converted to gRPC streams by gRPC Gateway (it's embedded in the ETCD server).

Thanks for the reminder, I understand that. I just want to point out that maybe the root cause of this problem is not necessarily on etcd.

Hard to judge. If you have captured some network packets. Would you like to share them, this maybe helpful for the troubleshooting.

@hansedong
Copy link
Contributor Author

hansedong commented May 20, 2022

@tokers It is very exaggerated that the connections of apisix to etcd in the ESTABLISHED state is very high and has been rising all the time.

[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:03 CST 2022
631
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:07 CST 2022
645
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:11 CST 2022
660
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:06:48 CST 2022
754
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:07:16 CST 2022
782

5078 is the pid of etcd, and 10.132.15.138 is the IP of apisix.

@tokers
Copy link
Contributor

tokers commented May 20, 2022

@tokers It is very exaggerated that the connections of apisix to etcd in the ESTABLISHED state is very high and has been rising all the time.

[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:03 CST 2022
631
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:07 CST 2022
645
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:11 CST 2022
660
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:06:48 CST 2022
754
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:07:16 CST 2022
782

5078 is the pid of etcd, and 10.132.15.138 is the IP of apisix.

How many APISIX worker processed that you created?

@hansedong
Copy link
Contributor Author

How many APISIX worker processed that you created?

[root@knode10-132-15-138 operation]# ps -ef |grep nginx
root     13648     1  0 May18 ?        00:00:00 nginx: master process openresty -p /usr/local/apisix -c /usr/local/apisix/conf/nginx.conf
nobody   13649 13648  0 May18 ?        00:05:46 nginx: worker process
nobody   13650 13648  0 May18 ?        00:06:22 nginx: worker process
nobody   13651 13648  0 May18 ?        00:05:50 nginx: worker process
nobody   13652 13648  0 May18 ?        00:05:46 nginx: worker process
nobody   13653 13648  0 May18 ?        00:05:46 nginx: worker process
nobody   13654 13648  0 May18 ?        00:05:52 nginx: worker process
nobody   13655 13648  0 May18 ?        00:06:06 nginx: worker process
nobody   13656 13648  0 May18 ?        00:06:03 nginx: worker process
nobody   13657 13648  0 May18 ?        00:00:00 nginx: cache manager process
root     13659 13648  0 May18 ?        00:05:43 nginx: privileged agent process

The CPU of my host is 8 cores, so the number of nginx workers is 8. My apisix configuration is as follows, and the number of workers is as expected:

nginx_config:                     # config for render the template to generate nginx.conf
  #user: root                     # specifies the execution user of the worker process.
                                  # the "user" directive makes sense only if the master process runs with super-user privileges.
                                  # if you're not root user,the default is current user.
  error_log: /DATA1/apisix/logs/error.log
  error_log_level:  warn          # warn,error
  worker_processes: auto          # if you want use multiple cores in container, you can inject the number of cpu as environment variable "APISIX_WORKER_PROCESSES"

@tzssangglass
Copy link
Member

From experience, this is a problem with etcd and I suggest you go to the etcd repository for help.

@hansedong
Copy link
Contributor Author

@tzssangglass @tokers

Thanks for the reply, I have located the cause of the problem is not on apisix.

When I use the curl command line to directly call the API of grpc-gateway, the request is also blocked, so the cause of the problem is indeed on etcd. I plan to close this issue later.

Thanks again everyone for the replies, apisix is a great product.

@Wang-Kai
Copy link

@tzssangglass @tokers

Thanks for the reply, I have located the cause of the problem is not on apisix.

When I use the curl command line to directly call the API of grpc-gateway, the request is also blocked, so the cause of the problem is indeed on etcd. I plan to close this issue later.

Thanks again everyone for the replies, apisix is a great product.

So, how do you solved the problem finally? my issue is the same with you .

@hansedong

@hansedong
Copy link
Contributor Author

@Wang-Kai

The root cause of this problem is ETCD's bug. ETCD's HTTP/2-based https connections are limited.

The official version 3.5.5 has not yet been released, but it has been fixed in branch 3.4 and a new version 3.4.20 has been released.

For branch 3.5, you can clone the ETCD source code and compile the release-3.5 branch directly (this branch has fixed the problem of HTTP/2 connections).

The way to recompile ETCD is as follows:

git checkout release-3.5
make GOOS=linux GOARCH=amd64

For more information, check out this issue: etcd-io/etcd#14169

@Wang-Kai
Copy link

Wang-Kai commented Sep 15, 2022

@Wang-Kai

The root cause of this problem is ETCD's bug. ETCD's HTTP/2-based https connections are limited.

The official version 3.5.5 has not yet been released, but it has been fixed in branch 3.4 and a new version 3.4.20 has been released.

For branch 3.5, you can clone the ETCD source code and compile the release-3.5 branch directly (this branch has fixed the problem of HTTP/2 connections).

The way to recompile ETCD is as follows:

git checkout release-3.5
make GOOS=linux GOARCH=amd64

For more information, check out this issue: etcd-io/etcd#14169

@hansedong Your resolution works for me extremely. Thanks so much !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants