Panic on varnishadm backend.set_health of dynamic backend #4183

delthas · 2024-09-16T16:08:06Z

Expected Behavior

No panic on the VTC.

Current Behavior

Panic:

Assert error in VBE_connwait_signal_all(), cache/cache_backend.c line 156:
Condition((bp)->magic == 0x64c4c7c6) not true.
cls_exec -> mgt_launch_child -> child_main -> CLI_Run -> VCLS_Poll -> cls_exec -> cli_backend_set_health -> VCL_IterDirector -> vcl_iterdir -> do_set_health -> VBE_connwait_signal_all -> VAS_Fail -> pan_ic

varnishtest "panic"

server s1 {
} -start

varnish v1 -vcl {
vcl 4.1;
import std;
import directors;
import dynamic;

backend foo none;

sub vcl_init {
    new cluster_default = directors.fallback();
    new main = dynamic.director(port = {"${s1_port}"});
    cluster_default.add_backend(main.backend(host={"${s1_addr}"}, port={"${s1_port}"}));
}
} -start

# Panic:
# Assert error in VBE_connwait_signal_all(), cache/cache_backend.c line 156:
#  Condition((bp)->magic == 0x64c4c7c6) not true.
# cls_exec -> mgt_launch_child -> child_main -> CLI_Run -> VCLS_Poll -> cls_exec -> cli_backend_set_health -> VCL_IterDirector -> vcl_iterdir -> do_set_health -> VBE_connwait_signal_all -> VAS_Fail -> pan_ic
varnish v1 -cliok "backend.set_health main(*) sick"

Even though my VTC involes libvmod-dynamic, this is likely an issue in Varnish, because this broke between Varnish 7.5.0 and Varnish 7.6.0

I ran a git bisect and the first failing commit is: e46f972 ("unsurprisingly" this is the commit adding the function which has the assert).

Context

I'm using dynamic backends inside directors.

The text was updated successfully, but these errors were encountered:

nigoroll · 2024-09-16T16:35:27Z

Thank you for the test case, this is reproducible immediately.

The problem here is that VBE_connwait_signal_all() gets called on a dynamic domain director.

looking... 👀

4  0x000055fa0d0376ec in pan_ic (func=0x55fa0d1299ec "VBE_connwait_signal_all", file=0x55fa0d1299af "cache/cache_backend.c", line=156, cond=0x55fa0d1299d2 "(bp)->magic == 0x64c4c7c6", kind=VAS_ASSERT) at cache/cache_panic.c:778
#5  0x000055fa0d0fead5 in VAS_Fail (func=0x55fa0d1299ec "VBE_connwait_signal_all", file=0x55fa0d1299af "cache/cache_backend.c", line=156, cond=0x55fa0d1299d2 "(bp)->magic == 0x64c4c7c6", kind=VAS_ASSERT) at vas.c:67
#6  0x000055fa0cff5340 in VBE_connwait_signal_all (bp=0x7ffa789d7000) at cache/cache_backend.c:156
#7  0x000055fa0d00e101 in do_set_health (cli=0x7ffa788d49d0, d=0x7ffa7881df30, priv=0x7ffcedc1dc90) at cache/cache_director.c:495
#8  0x000055fa0d04ef07 in vcl_iterdir (cli=0x7ffa788d49d0, pat=0x7ffa788202c0 "vcl1.main(*)", vcl=0x7ffa788471e0, func=0x55fa0d00def0 <do_set_health>, priv=0x7ffcedc1dc90) at cache/cache_vcl.c:385
#9  0x000055fa0d04ed4a in VCL_IterDirector (cli=0x7ffa788d49d0, pat=0x7ffa788283c8 "main(*)", func=0x55fa0d00def0 <do_set_health>, priv=0x7ffcedc1dc90) at cache/cache_vcl.c:421
#10 0x000055fa0d00d3fa in cli_backend_set_health (cli=0x7ffa788d49d0, av=0x7ffa7882c580, priv=0x0) at cache/cache_director.c:519
#11 0x000055fa0d103872 in cls_dispatch (cli=0x7ffa788d49d0, cs=0x7ffa78822d60, av=0x7ffa7882c580, ac=3) at vcli_serve.c:273
#12 0x000055fa0d10335d in cls_exec (cfd=0x7ffa788d49a0, av=0x7ffa7882c580, ac=3) at vcli_serve.c:324
#13 0x000055fa0d1021f6 in cls_feed (cfd=0x7ffa788d49a0, p=0x7ffcedc1de30 "\n54.947472/vgc.so 1auto\nhy", e=0x7ffcedc1de31 "54.947472/vgc.so 1auto\nhy") at vcli_serve.c:412
#14 0x000055fa0d101cba in VCLS_Poll (cs=0x7ffa78822d60, cli=0x7ffa788d49d0, timeout=-1) at vcli_serve.c:617
#15 0x000055fa0d0060b2 in CLI_Run () at cache/cache_cli.c:110
#16 0x000055fa0d031925 in child_main (sigmagic=1, altstksz=81920) at cache/cache_main.c:453
...
(gdb) fr 6
#6  0x000055fa0cff5340 in VBE_connwait_signal_all (bp=0x7ffa789d7000) at cache/cache_backend.c:156
156		CHECK_OBJ_NOTNULL(bp, BACKEND_MAGIC);
(gdb) p /x bp->magic
$2 = 0x1bfe1345

$ git grep 0x1bfe1345
src/vmod_dynamic.h:#define DYNAMIC_DOMAIN_MAGIC         0x1bfe1345

nigoroll · 2024-09-16T16:40:53Z

I pointed out a layer violation at the time, and overlooked this one. The director code should really not call into VBE code directly...

dridi · 2024-09-16T17:02:56Z

This was bothering me but I couldn't put my finger on it at the time. I also found another wrong assumption (#4134 (review)) and I'm not entirely convinced by VBE_is_ah_auto().

the idea here is to use the backend's healthy callback to signal to it a health change from outside. this fixes the regression from varnishcache#4183, but makes v74.vtc fail. And also I am not happy about the locking involved and need to spend more time...

Fixes varnishcache#4183

Pondering solutions for varnishcache#4183 it became apparent that we currently have no way for core code to notify directors of changes, in particular changes of the admin_health. I looked into re-using the healthy callback, but it is unsuitable because VRT_Healthy() does not even call it for an "auto" admin_health. So the cleanest option is another director callback, which is cheap and also helps us maintain clean layering.

Using the new facility, we can now selectively notify interested backend types (so far: VBE only) when the admin health changes. This fixes the layer violation introduced with e46f972, where a director private pointer was used with a VBE specific function. Fixes varnishcache#4183

Pondering solutions for varnishcache#4183 it became apparent that we currently have no way for core code to notify directors of changes, in particular changes of the admin_health. I looked into re-using the healthy callback, but it is unsuitable because VRT_Healthy() does not even call it for an "auto" admin_health. So the cleanest option is another director callback, which is cheap and also helps us maintain clean layering.

Using the new facility, we can now selectively notify interested backend types (so far: VBE only) when the admin health changes. This fixes the layer violation introduced with e46f972, where a director private pointer was used with a VBE specific function. Fixes varnishcache#4183

from varnishcache/varnish-cache#4183 contributed by delthas

Refs: * varnishcache/varnish-cache#4183 * varnishcache/varnish-cache#4186