feat: enable etcd health-check #4191

Yiyiyimu · 2021-05-06T23:41:59Z

What this PR does / why we need it:

fix #3673
fix #3937.

Import health check of lua-resty-etcd, so

when one etcd node disconnected, apisix would try to connect another one and perform as normal, while health-check would continue print warning log about disconnected etcd node
when all etcd nodes disconnected, apisix would try to re-connect with etcd, with a cyclic backoff interval for 2s, 2s, 4s, ..., 1024s, 2s, 4s ... so the error log would not been flushed when etcd are all down.

TODO:

for etcd node deployed in kubernetes, since nodes are behind the same domain, we could not know if one of them is taint, as talked in [discuss]: enable etcd health check #3692
add test, but since the current health check does not work for kubernetes, need to implement a new CI pipeline for multiple etcd nodes

Pre-submission checklist:

Did you explain what problem does this PR solve? Or what new features have been added?
Have you added corresponding test cases?
Have you modified the corresponding document?
Is this PR backward compatible? If it is not backward compatible, please discuss on the mailing list first

…e apisix fail Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

…nto fix/etcd-retry

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

…rks in stream mode Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

juzhiyuan · 2021-05-10T02:16:32Z

We need another fix hh

…not been blocked Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Yiyiyimu · 2021-05-10T21:12:07Z

Things get a bit weird right now 😕

Before I work on this fix, I test that if killing one node out of an etcd cluster, apisix would fail, as those related issues talk about. However, right now when I test this scenario again on the master branch, apisix would not be affected by the closed etcd node and runs normally. I checked the recent commits and it seems no changes have applied to this problem.

Could someone help me to reproduce the program again

Test running apisix and etcd in Kubernetes, delete one etcd node, and apisix runs normally with master branch. So don't know if anything went wrong yet.

Yiyiyimu · 2021-05-10T21:31:52Z

Waiting for lua-resty-etcd new release: api7/lua-resty-etcd#129

apisix/core/config_etcd.lua

…nto fix/etcd-retry

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

…st etcd endpoint Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

apisix/core/config_etcd.lua

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Yiyiyimu · 2021-06-21T23:19:25Z

Waiting for api7/lua-resty-etcd#131

Done

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

…oint choose problem in etcd (https://github.com/api7/lua-resty-etcd/pull/131\#discussion_r655804238) Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

apisix/core/config_etcd.lua

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

apisix/core/config_etcd.lua

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

tokers · 2021-06-29T05:01:14Z

apisix/core/config_etcd.lua

+                if string.find(err, err_etcd_unhealthy_all) then
+                    local reconnected = false
+                    while err and not reconnected do
+                        local backoff_duration, backoff_factor, backoff_step = 1, 2, 10


Will step 10 too large? As 1024 seconds is too long.

I'm not so familiar with the production environment. Do you have some recommendations, like 8, around 4 mins at most? @tokers

Maybe 6 is enough, also, since we are in a timer, keep a Nginx timer living for a long while is not a good practice as it might cause the memory leaky.

Thanks for suggestion!
Changed to use outside counter till 32 to avoid keeping nginx timer too long. PTAL @tokers

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Yiyiyimu added 7 commits May 6, 2021 19:32

fix: enable etcd health-check, so one etcd node failure would not mak…

5b9fc40

…e apisix fail Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Merge branch 'master' of https://github.com/apache/incubator-apisix i…

83a1a31

…nto fix/etcd-retry

fix lint

c42c307

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix: move init health check away from config_etcd init, to make it wo…

eede91f

…rks in stream mode Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix lint

803389f

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

add test

feccb69

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix typo

21fae75

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Yiyiyimu changed the title ~~fix: enable etcd health-check, so one etcd node failure would not make apisix fail~~ fix: enable etcd health-check May 10, 2021

Yiyiyimu marked this pull request as ready for review May 10, 2021 02:04

Yiyiyimu added 7 commits May 9, 2021 22:25

turn health_check init error log level to warn, so stream init would …

1c32984

…not been blocked Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

compatible with etcd return format

3c03ffe

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix lint

e91d5af

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix privilige

acd2528

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix docker compose file path

78790e1

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix docker compose command

f598b3c

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix docker compose command

4f794de

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Yiyiyimu requested a review from spacewander May 10, 2021 21:34

spacewander reviewed May 11, 2021

View reviewed changes

apisix/core/config_etcd.lua Show resolved Hide resolved

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved

Yiyiyimu mentioned this pull request May 17, 2021

Billboard: all chaos test to do ( welcome new ideas!😆) #3449

Closed

10 tasks

Yiyiyimu added 4 commits June 2, 2021 16:42

Merge branch 'master' of https://github.com/apache/incubator-apisix i…

3cf74c0

…nto fix/etcd-retry

forcely generate new etcd client when needed

d3954fc

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

test: change to kill etcd-1 since new etcd client would use it as fir…

5bc39f6

…st etcd endpoint Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

add more logs

30d9146

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

tokers reviewed Jun 4, 2021

View reviewed changes

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved

spacewander reviewed Jun 4, 2021

View reviewed changes

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved

apisix/core/config_etcd.lua Show resolved Hide resolved

spacewander mentioned this pull request Jun 11, 2021

bug: etcd cluster setting,one node down,cannot connect to other healthy one #4414

Closed

no need to get new etcd client

742c543

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

move one-node-down retry to lua-resty-etcd

f1790f0

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Yiyiyimu added 4 commits June 21, 2021 19:27

fix lint

14f6006

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

update etcd to v1.5.3

49b2121

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

make etcd health check timeout configurable

36561c1

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

test both the first and second etcd endpoint due to the original endp…

aefe2d6

…oint choose problem in etcd (https://github.com/api7/lua-resty-etcd/pull/131\#discussion_r655804238) Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

spacewander reviewed Jun 25, 2021

View reviewed changes

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved

fix test

bb577f6

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

spacewander changed the title ~~fix: enable etcd health-check~~ feat: enable etcd health-check Jun 29, 2021

spacewander approved these changes Jun 29, 2021

View reviewed changes

tokers reviewed Jun 29, 2021

View reviewed changes

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved

remove loop increment

043eafc

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

tokers reviewed Jun 29, 2021

View reviewed changes

avoid keep nginx timer too long time

091b194

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

tokers approved these changes Jun 30, 2021

View reviewed changes

spacewander merged commit 994f020 into apache:master Jun 30, 2021

Yiyiyimu linked an issue Jun 30, 2021 that may be closed by this pull request

bug: etcd cluster setting,one node down,cannot connect to other healthy one #4414

Closed

Yiyiyimu removed a link to an issue Jun 30, 2021

bug: etcd cluster setting,one node down,cannot connect to other healthy one #4414

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable etcd health-check #4191

feat: enable etcd health-check #4191

Yiyiyimu commented May 6, 2021 •

edited by spacewander

Loading

juzhiyuan commented May 10, 2021

Yiyiyimu commented May 10, 2021 •

edited

Loading

Yiyiyimu commented May 10, 2021

Yiyiyimu commented Jun 21, 2021 •

edited

Loading

tokers Jun 29, 2021

Yiyiyimu Jun 29, 2021 •

edited

Loading

tokers Jun 30, 2021

Yiyiyimu Jun 30, 2021

feat: enable etcd health-check #4191

feat: enable etcd health-check #4191

Conversation

Yiyiyimu commented May 6, 2021 • edited by spacewander Loading

What this PR does / why we need it:

Pre-submission checklist:

juzhiyuan commented May 10, 2021

Yiyiyimu commented May 10, 2021 • edited Loading

Yiyiyimu commented May 10, 2021

Yiyiyimu commented Jun 21, 2021 • edited Loading

tokers Jun 29, 2021

Choose a reason for hiding this comment

Yiyiyimu Jun 29, 2021 • edited Loading

Choose a reason for hiding this comment

tokers Jun 30, 2021

Choose a reason for hiding this comment

Yiyiyimu Jun 30, 2021

Choose a reason for hiding this comment

Yiyiyimu commented May 6, 2021 •

edited by spacewander

Loading

Yiyiyimu commented May 10, 2021 •

edited

Loading

Yiyiyimu commented Jun 21, 2021 •

edited

Loading

Yiyiyimu Jun 29, 2021 •

edited

Loading