Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still happens: Dynamic Backends can only be added to warm VCLs #117

Open
nigoroll opened this issue May 27, 2024 · 6 comments
Open

Still happens: Dynamic Backends can only be added to warm VCLs #117

nigoroll opened this issue May 27, 2024 · 6 comments

Comments

@nigoroll
Copy link
Owner

nigoroll commented May 27, 2024

I'm getting the same assert again, but in a slightly different way (even with the fix).

The stack is still:

Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x5f26789d8ebe sp=0x70bc775fe870 <pan_ic+0x34e>
  ip=0x5f2678aa5985 sp=0x70bc775fe9e0 <VAS_Fail+0x55>
  ip=0x5f2678a0ee86 sp=0x70bc775fea30 <VRT_AddDirector+0x956>
  ip=0x5f267899527d sp=0x70bc775feb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x5f2678995d22 sp=0x70bc775fec80 <VRT_new_backend+0x132>
  ip=0x70bc82abfe61 sp=0x70bc775fecb0 <ref_add+0x660>
  ip=0x70bc82ac0acc sp=0x70bc775ff1d0 <dom_update+0xb69>
  ip=0x70bc82ac1395 sp=0x70bc775ff340 <dom_lookup_thread+0x485>
  ip=0x70bc99399ded sp=0x70bc775ff450
  ip=0x70bc9941d0dc sp=0x70bc775ff500

I'm stressing Varnish + libvmod-dynamic with a lot of VCL reloads. Everything usually works fine for a while (sometimes hours), until Varnish starts to loop on this error, every ~15s to ~1 minute. So, no issues for hours and then once I get this panic, it repeats every ~15s to ~1min: child crashes, child restarts, child crashes, ...

Here's an excerpt of the full logs:

Debug: Child (606349) Started
Child launched OK
Info: Child (606349) said Child starts
Error: Child (606349) not responding to CLI, killed it.
Error: Child (606349) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error (hdr)
Error: Child (606349) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (606349) died signal=3 (core dumped)
Debug: Child cleanup complete
Debug: Child (645026) Started
Child launched OK
Info: Child (645026) said Child starts
Error: Child (645026) not responding to CLI, killed it.
Error: Child (645026) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (645026) died signal=6 (core dumped)
Error: Child (645026) Panic at: Fri, 24 May 2024 14:58:32 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e6579fe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e6579fe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e6579fea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e6579feb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e6579fec80 <VRT_new_backend+0x132>
  ip=0x76e661960e61 sp=0x76e6579fecb0 <ref_add+0x660>
  ip=0x76e661961acc sp=0x76e6579ff1d0 <dom_update+0xb69>
  ip=0x76e661962395 sp=0x76e6579ff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e6579ff450
  ip=0x76e676ebe0dc sp=0x76e6579ff500
Debug: Child cleanup complete
Debug: Child (649299) Started
Child launched OK
Info: Child (649299) said Child starts
Error: Child (649299) not responding to CLI, killed it.
Error: Child (649299) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (649299) died signal=6 (core dumped)
Error: Child (649299) Panic at: Fri, 24 May 2024 14:58:46 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e654bfe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e654bfe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e654bfea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e654bfeb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e654bfec80 <VRT_new_backend+0x132>
  ip=0x76e661686e61 sp=0x76e654bfecb0 <ref_add+0x660>
  ip=0x76e661687acc sp=0x76e654bff1d0 <dom_update+0xb69>
  ip=0x76e661688395 sp=0x76e654bff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e654bff450
  ip=0x76e676ebe0dc sp=0x76e654bff500
Debug: Child cleanup complete
Debug: Child (653709) Started
Child launched OK
Info: Child (653709) said Child starts
Error: Child (653709) not responding to CLI, killed it.
Error: Child (653709) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (653709) died signal=6 (core dumped)
Error: Child (653709) Panic at: Fri, 24 May 2024 14:59:01 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e6523fe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e6523fe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e6523fea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e6523feb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e6523fec80 <VRT_new_backend+0x132>
  ip=0x76e66099fe61 sp=0x76e6523fecb0 <ref_add+0x660>
  ip=0x76e6609a0acc sp=0x76e6523ff1d0 <dom_update+0xb69>
  ip=0x76e6609a1395 sp=0x76e6523ff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e6523ff450
  ip=0x76e676ebe0dc sp=0x76e6523ff500
Debug: Child cleanup complete
Debug: Child (658076) Started
Child launched OK
Info: Child (658076) said Child starts
Error: Child (658076) not responding to CLI, killed it.
Error: Child (658076) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (658076) died signal=6 (core dumped)
Error: Child (658076) Panic at: Fri, 24 May 2024 14:59:21 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e655ffe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e655ffe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e655ffea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e655ffeb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e655ffec80 <VRT_new_backend+0x132>
  ip=0x76e6617a5e61 sp=0x76e655ffecb0 <ref_add+0x660>
  ip=0x76e6617a6acc sp=0x76e655fff1d0 <dom_update+0xb69>
  ip=0x76e6617a7395 sp=0x76e655fff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e655fff450
  ip=0x76e676ebe0dc sp=0x76e655fff500
Debug: Child cleanup complete
Debug: Child (662441) Started
Child launched OK
Info: Child (662441) said Child starts
Error: Child (662441) not responding to CLI, killed it.
Error: Child (662441) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (662441) died signal=6 (core dumped)
Error: Child (662441) Panic at: Fri, 24 May 2024 14:59:38 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e6567fe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e6567fe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e6567fea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e6567feb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e6567fec80 <VRT_new_backend+0x132>
  ip=0x76e662eaae61 sp=0x76e6567fecb0 <ref_add+0x660>
  ip=0x76e662eabacc sp=0x76e6567ff1d0 <dom_update+0xb69>
  ip=0x76e662eac395 sp=0x76e6567ff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e6567ff450
  ip=0x76e676ebe0dc sp=0x76e6567ff500
Debug: Child cleanup complete

It continued crashing in a loop indefinitely for multiple days. I managed to attach to the process and the VCL was indeed COOL at the time of the panic.

Originally posted by @delthas in #108 (comment)

@nigoroll nigoroll changed the title I'm getting the same assert again, but in a slightly different way (even with the fix). Still happens: Dynamic Backends can only be added to warm VCLs May 27, 2024
@nigoroll
Copy link
Owner Author

@delthas apologies for having lost track of this issue. If you have time, could you please (re)confirm that it still happens with latest Varnish-Cache? In particular, I would like to exclude the possibility of this being related to varnishcache/varnish-cache#4108, for which the fix went in on Monday.

@pnbecker
Copy link

@nigoroll I can confirm this still happens with Varnish 7.5.0. Varnish 7.5.0 crashed here this morning with:

Wrong turn at cache/cache_vrt_vcl.c:252:
Dynamic Backends can only be added to warm VCLs

Panic.show in varnishadm reported:

varnish> panic.show
200        
Panic at: Fri, 19 Jul 2024 07:36:47 GMT
Wrong turn at cache/cache_vrt_vcl.c:252:
Dynamic Backends can only be added to warm VCLs
version = varnish-7.5.0 revision eef25264e5ca5f96a77129308edb83ccf84cb1b1, vrt api = 19.0
ident = Linux,5.15.111-flatcar,x86_64,-jnone,-smalloc,-sdefault,-hcritbit,epoll
now = 954234.585670 (mono), 1721374607.072795 (real)
Backtrace:
  0x55f72eefecfe: varnishd(+0x5ccfe) [0x55f72eefecfe]
  0x55f72ef7f2c5: varnishd(VAS_Fail+0x45) [0x55f72ef7f2c5]
  0x55f72ef2301b: varnishd(+0x8101b) [0x55f72ef2301b]
  0x55f72eed381d: varnishd(VRT_new_backend_clustered+0x45d) [0x55f72eed381d]
  0x7f976a0cf394: ./vmod_cache/_vmod_dynamic.6626bfa332c102136cbd5ce44844ca052182968667357aa177db213793d9b324(+0x7394) [0x7f976a0cf394]
  0x7f9777b2d134: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x7f9777b2d134]
  0x7f9777baca40: /lib/x86_64-linux-gnu/libc.so.6(__clone+0x40) [0x7f9777baca40]
argv = {
  [0] = \"varnishd\",
  [1] = \"-F\",
  [2] = \"-f\",
  [3] = \"/etc/varnish/default.vcl\",
  [4] = \"-a\",
  [5] = \"http=:80,HTTP\",
  [6] = \"-a\",
  [7] = \"proxy=:8443,PROXY\",
  [8] = \"-p\",
  [9] = \"feature=+http2\",
  [10] = \"-s\",
  [11] = \"malloc,27G\",
  [12] = \"-p\",
  [13] = \"feature=+http2\",
  [14] = \"-p\",
  [15] = \"default_keep=300\",
}
pthread.self = 0x7f9766e326c0
pthread.attr = {
  guard = 4096,
  stack_bottom = 0x7f9766633000,
  stack_top = 0x7f9766e33000,
  stack_size = 8388608,
}
thr.req = NULL
thr.busyobj = NULL
thr.worker = NULL
vmods = {
  std = {0x7f97774d7230, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},
  cookie = {0x7f97774d72a0, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},
  dynamic = {0x7f97774d7380, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 19.0},
  directors = {0x7f97774d73f0, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},
},
pools = {
  pool = 0x7f9770200000 {
    nidle = 96,
    nthr = 100,
    lqueue = 0
  },
  pool = 0x7f9770200640 {
    nidle = 96,
    nthr = 100,
    lqueue = 0
  },
},

The log of the varnish docker container contained more or less the same information:

Debug: Version: varnish-7.5.0 revision eef25264e5ca5f96a77129308edb83ccf84cb1b1,
Debug: Platform: Linux,5.15.111-flatcar,x86_64,-jnone,-smalloc,-sdefault,-hcritbit,
Debug: Child (22) Started,
Child launched OK,
Info: Child (22) said Child starts,
Error: Child (22) not responding to CLI, killed it.,
Error: Unexpected reply from ping: 400 CLI communication error (hdr),
Error: Child (22) died signal=9,
Debug: Child cleanup complete,
Debug: Child (4946) Started,
Child launched OK,
Error: Child (4946) not responding to CLI, killed it.,
Error: Child (4946) Pushing vcls failed:,
CLI communication error (hdr),
Debug: Stopping Child,
Info: Child (4946) said Child starts,
Error: Child (4946) died signal=6 (core dumped),
Error: Child (4946) Panic at: Fri, 19 Jul 2024 07:36:47 GMT,
Wrong turn at cache/cache_vrt_vcl.c:252:,
Dynamic Backends can only be added to warm VCLs,
version = varnish-7.5.0 revision eef25264e5ca5f96a77129308edb83ccf84cb1b1, vrt api = 19.0,
ident = Linux,5.15.111-flatcar,x86_64,-jnone,-smalloc,-sdefault,-hcritbit,epoll,
now = 954234.585670 (mono), 1721374607.072795 (real),
Backtrace:,
  0x55f72eefecfe: varnishd(+0x5ccfe) [0x55f72eefecfe],
  0x55f72ef7f2c5: varnishd(VAS_Fail+0x45) [0x55f72ef7f2c5],
  0x55f72ef2301b: varnishd(+0x8101b) [0x55f72ef2301b],
  0x55f72eed381d: varnishd(VRT_new_backend_clustered+0x45d) [0x55f72eed381d],
  0x7f976a0cf394: ./vmod_cache/_vmod_dynamic.6626bfa332c102136cbd5ce44844ca052182968667357aa177db213793d9b324(+0x7394) [0x7f976a0cf394],
  0x7f9777b2d134: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x7f9777b2d134],
  0x7f9777baca40: /lib/x86_64-linux-gnu/libc.so.6(__clone+0x40) [0x7f9777baca40],
argv = {,
  [0] = \"varnishd\",,
  [1] = \"-F\",,
  [2] = \"-f\",,
  [3] = \"/etc/varnish/default.vcl\",,
  [4] = \"-a\",,
  [5] = \"http=:80,HTTP\",,
  [6] = \"-a\",,
  [7] = \"proxy=:8443,PROXY\",,
  [8] = \"-p\",,
  [9] = \"feature=+http2\",,
  [10] = \"-s\",,
  [11] = \"malloc,27G\",,
  [12] = \"-p\",,
  [13] = \"feature=+http2\",,
  [14] = \"-p\",,
  [15] = \"default_keep=300\",,
},
pthread.self = 0x7f9766e326c0,
pthread.attr = {,
  guard = 4096,,
  stack_bottom = 0x7f9766633000,,
  stack_top = 0x7f9766e33000,,
  stack_size = 8388608,,
},
thr.req = NULL,
thr.busyobj = NULL,
thr.worker = NULL,�,
{,
  std = {0x7f97774d7230, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},,
  cookie = {0x7f97774d72a0, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},,
  dynamic = {0x7f97774d7380, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 19.0},,
  directors = {0x7f97774d73f0, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},,
},,�,
{,
  pool = 0x7f9770200000 {,
    nidle = 96,,
    nthr = 100,,
    lqueue = 0,
  },,
  pool = 0x7f9770200640 {,
    nidle = 96,,
    nthr = 100,,
    lqueue = 0,
  },,
},,
,
,
Debug: Child cleanup complete,

@nigoroll
Copy link
Owner Author

Could anyone who can reproduce this issue please run varnishd with this little patch and report the Panic?

diff --git a/bin/varnishd/cache/cache_vrt_vcl.c b/bin/varnishd/cache/cache_vrt_vcl.c
index 82e6e10a5..a17db5726 100644
--- a/bin/varnishd/cache/cache_vrt_vcl.c
+++ b/bin/varnishd/cache/cache_vrt_vcl.c
@@ -249,7 +249,7 @@ VRT_AddDirector(VRT_CTX, const struct vdi_methods *m, void *priv,
                return (vcldir_surplus(vdir));
 
        if (!temp->is_warm && temp != VCL_TEMP_INIT)
-               WRONG("Dynamic Backends can only be added to warm VCLs");
+               WRONG(temp->name);
 
        return (vdir->dir);
 }

@nigoroll
Copy link
Owner Author

I am trying hard to reproduce this with current code (varnishcache/varnish-cache@7292f92 and vmod_dynamic e1b9307), no success.

Test vcl which I am using

vcl 4.1;

import dynamic;
import directors;

backend proforma none;

sub vcl_init {
	new r1 = dynamic.resolver();
	r1.set_resolution_type(RECURSING); # tried with and without

	new d = dynamic.director(
		resolver = r1.use(),
		ttl_from = dns
	);

	new rr = directors.round_robin();
	rr.add_backend(d.backend("www.uplex.de"));
}
sub vcl_recv {
	return (synth(200));
}
sub vcl_synth {
	set resp.body = "" + rr.backend().resolve() + " " +
		d.backend("uplex.de").resolve() + """
""";
	return (deliver);
}

start:

varnishd -a 127.0.0.1:8080 -f /tmp/t.vcl

vcl load loop:

while : ; do n=v$RANDOM ; varnishadm vcl.load $n /tmp/t.vcl && varnishadm vcl.use $n ; varnishadm vcl.list | awk '/^av/ {print "vcl.discard " $5}'|varnishadm ; varnishadm panic.show ; done

curl loop (in a second terminal):

while : ; do curl localhost:8080 ; done

I just don't get the panic, also not with this addition to deliberately delay resolution:

diff --git a/src/dyn_resolver_getdns.c b/src/dyn_resolver_getdns.c
index c674f1b..f97b761 100644
--- a/src/dyn_resolver_getdns.c
+++ b/src/dyn_resolver_getdns.c
@@ -32,6 +32,7 @@
 //#include <sys/types.h>
 #include <sys/socket.h>
 #include <netdb.h>
+#include <unistd.h>
 
 #ifdef __FreeBSD__
 #include <arpa/inet.h>
@@ -203,6 +204,7 @@ getdns_lookup(struct VPFX(dynamic_resolver) *r,
        ret = getdns_address_sync(c->context, node, NULL, &state->response);
        errchk(ret);
 
+       usleep(1000 * 1000);
        return (getdns_common_lookup_check(state));
 }
 

So, can anyone please help me with a reproducer?

@nigoroll
Copy link
Owner Author

@nigoroll I can confirm this still happens with Varnish 7.5.0. Varnish 7.5.0 crashed here this morning with:

FTR: The container build is https://github.com/varnish/docker-varnish/blob/master/fresh/debian/Dockerfile#L9C26-L9C66 which uses 5dc09f5, which is a minor change on top of 47bfcf5

So if this was root-caused in dynamic, it should still happen.

nigoroll added a commit that referenced this issue Jul 30, 2024
So it turns out, that - still - we can deadlock raching between VCL
events and director deletions (see
varnishcache/varnish-cache#4140). This commit
was originally written to fix #108, but it now enables us to take a
different route again: Do not pthread_join() lookup threads, but rather
detach them and make sure _they_ return the reference before terminating.

The remainder of the commit message is an edit in light of the new
insights:

Our lookup threads may continue for a bit after a VCL_COLD event has
been received: dom_event() sets dom->status to DYNAMIC_ST_DONE while
the lookup thread may be blocking in the resolver or updating a
domain.

So, lookup threads might try to create backends even after the VCL has
gone cold, in which case VRT_AddDirector() would panic with "Dynamic
Backends can only be added to warm VCLs" - _unless_ a reference is
present on the VCL to make the VCL state change go through
VCL_TEMP_COOLING, in which case VRT_AddDirector() just returns NULL.

vmod_dynamic took a VCL reference since
99912bc back in May 2019, but before
the merge of varnishcache/varnish-cache#4037
it did not work correctly, because the VCL reference was returned
before all threads had terminated.

The naive idea to solve this would be to take/release a VCL reference
in each lookup thread (or when starting/stopping it), but references
can only be taken from the CLI context (ASSERT_CLI() in
VRT_VCL_Prevent_Discard). Also, to support use of the .resolve() type
method from vcl_init{}, we needed to depart from the original concept
of running resolver threads only for warm vcls.

We get out of this situation by using a reference on top of the VCL
reference, which we can take even before we have the VCL reference.
The VCL reference is returned once all internal "referef"s are returned.

Ref #108

Probably fixes #117 also
@nigoroll
Copy link
Owner Author

Oops, I did not mean to close this issue before confirmation. (too much github magic, closing issues across projects is, hm, interesting)

@nigoroll nigoroll reopened this Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants