Multiple processes tripped up by a non-existent vhost leads to node termination #1280

michaelklishin · 2017-06-30T07:49:39Z

Accidentally reproduced while working on an issue in CLI tools:

2017-06-30 10:45:37.649 [error] <0.505.0> Supervisor {<0.505.0>,rabbit_vhost_sup_wrapper} had child rabbit_vhost_sup started with rabbit_vhost_sup_wrapper:start_vhost_sup(<<"test_vhost1">>) at undefined exit with reason {'EXIT',{{badmatch,{error,{no_such_vhost,<<"test_vhost1">>}}},[{rabbit_recovery_terms,start,1,[{file,"src/rabbit_recovery_terms.erl"},{line,51}]},{rabbit_queue_index,start,2,[{file,"src/rabbit_queue_index.erl"},{line,502}]},{rabbit_variable_queue,start,2,[{file,"src/rabbit_variable_queue.erl"},{line,482}]},{rabbit_priority_queue,start,2,[{file,"src/rabbit_priority_queue.erl"},{line,92}]},{rabbit_amqqueue,recover,1,[{file,"src/rabbit_amqqueue.erl"},{line,236}]},{rabbit_vhost,recover,1,[{file,...},...]},...]}} in context start_error
2017-06-30 10:45:37.649 [info] <0.505.0> Making sure data directory '/var/folders/b3/6thx2ls900z6h456z2xb658m0000gn/T/rabbitmq-test-instances/rabbit/mnesia/rabbit/msg_stores/vhosts/BPY12RLD3HXBADXVD0QCEE7DC' for vhost 'test_vhost1' exists
2017-06-30 10:45:37.650 [error] <0.505.0> Supervisor {<0.505.0>,rabbit_vhost_sup_wrapper} had child rabbit_vhost_sup started with rabbit_vhost_sup_wrapper:start_vhost_sup(<<"test_vhost1">>) at undefined exit with reason {'EXIT',{{badmatch,{error,{no_such_vhost,<<"test_vhost1">>}}},[{rabbit_recovery_terms,start,1,[{file,"src/rabbit_recovery_terms.erl"},{line,51}]},{rabbit_queue_index,start,2,[{file,"src/rabbit_queue_index.erl"},{line,502}]},{rabbit_variable_queue,start,2,[{file,"src/rabbit_variable_queue.erl"},{line,482}]},{rabbit_priority_queue,start,2,[{file,"src/rabbit_priority_queue.erl"},{line,92}]},{rabbit_amqqueue,recover,1,[{file,"src/rabbit_amqqueue.erl"},{line,236}]},{rabbit_vhost,recover,1,[{file,...},...]},...]}} in context start_error
2017-06-30 10:45:37.650 [error] <0.505.0> Supervisor {<0.505.0>,rabbit_vhost_sup_wrapper} had child rabbit_vhost_sup started with rabbit_vhost_sup_wrapper:start_vhost_sup(<<"test_vhost1">>) at {restarting,<0.506.0>} exit with reason reached_max_restart_intensity in context shutdown
2017-06-30 10:45:37.650 [error] <0.328.0> Supervisor rabbit_vhost_sup_sup had child rabbit_vhost started with rabbit_vhost_sup_wrapper:start_link(<<"test_vhost1">>) at <0.505.0> exit with reason shutdown in context child_terminated
2017-06-30 10:45:37.650 [error] <0.328.0> Supervisor rabbit_vhost_sup_sup had child rabbit_vhost started with rabbit_vhost_sup_wrapper:start_link(<<"test_vhost1">>) at <0.505.0> exit with reason reached_max_restart_intensity in context shutdown
2017-06-30 10:45:37.650 [info] <0.338.0> Stopping message store for directory '/var/folders/b3/6thx2ls900z6h456z2xb658m0000gn/T/rabbitmq-test-instances/rabbit/mnesia/rabbit/msg_stores/vhosts/9FYW91Z090TQ8GC4PKLCNIK12/msg_store_persistent'
2017-06-30 10:45:37.650 [error] <0.257.0> Supervisor rabbit_sup had child rabbit_vhost_sup_sup started with rabbit_vhost_sup_sup:start_link() at <0.328.0> exit with reason shutdown in context child_terminated
2017-06-30 10:45:37.650 [info] <0.371.0> Stopping message store for directory '/var/folders/b3/6thx2ls900z6h456z2xb658m0000gn/T/rabbitmq-test-instances/rabbit/mnesia/rabbit/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2017-06-30 10:45:37.650 [error] <0.257.0> Supervisor rabbit_sup had child rabbit_vhost_sup_sup started with rabbit_vhost_sup_sup:start_link() at <0.328.0> exit with reason reached_max_restart_intensity in context shutdown

To reproduce this, run an integration test that creates a vhost, does something very quickly
(e.g. in our CLI suite, runs a command that fails validation immediately) and deletes the vhost.
As a result, one of the vhost supervision tree processes can trip up because the record no longer
exists due to a concurrent deletion.

Such cases should be logged as errors and ignored otherwise.

The text was updated successfully, but these errors were encountered:

michaelklishin · 2017-06-30T15:23:24Z

Digging deeper and fixing a bunch of terminations here and there (instead we log an error and do nothing), I still can reproduce node shutdown because vhost watcher can run into a non-existent vhost. That is by design, see #1158. We will revisit what default strategy should be .

@kjnilsson

In the case of a vhost creation and immediate deletion those operations can end up executing concurrently. This changes a number of things to be more resilient: * Default vhost restart strategy changes to "continue" (previosly "ignore"). It's a tough default to pick but we believe the damage of a single vhost terminating the entire node that may have a lot of users is greater than the need to set up an alternative strategy for environments where only one vhost is used. Note that the event of message store or vhost supervision tree termination should be very rare. Per discussion with @kjnilsson. * For vhost [message store] recovery, we now log "no_such_vhost" errors in a sensible way instead of a bunch of process terminations. * Max restart intensity settings for rabbit_vhost_sup_wrapper are changed to accommodate integration test suites that rapidly set up temporary vhost and delete them. In most failure message store scenarios the new value is still low enough to lead to vhost supervision tree termination when recovery is impossible. There are also drive by doc changes. Fixes #1280.

Intristic child causes the supervisor to stop with the `shutdown` signal, which will crash the node with `stop_node` strategy. This can trigger a node crash in case of race condition between deleting a vhost and a watcher timeout. Fixes [#1280]

michaelklishin added bug effort-low labels Jun 30, 2017

michaelklishin added this to the 3.7.0 milestone Jun 30, 2017

michaelklishin self-assigned this Jun 30, 2017

michaelklishin changed the title ~~rabbit_vhost_sup_wrapper:start_vhost_sup/1 does not handle non-existent vhosts gracefully~~ Recovery terms process tripped up by a non-existent vhost leads to node termination Jun 30, 2017

michaelklishin changed the title ~~Recovery terms process tripped up by a non-existent vhost leads to node termination~~ Multiple processes tripped up by a non-existent vhost leads to node termination Jun 30, 2017

michaelklishin mentioned this issue Jul 1, 2017

Handle concurrent vhost creation and deletion better #1283

Merged

michaelklishin closed this as completed in #1283 Jul 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple processes tripped up by a non-existent vhost leads to node termination #1280

Multiple processes tripped up by a non-existent vhost leads to node termination #1280

michaelklishin commented Jun 30, 2017 •

edited

Loading

michaelklishin commented Jun 30, 2017

Multiple processes tripped up by a non-existent vhost leads to node termination #1280

Multiple processes tripped up by a non-existent vhost leads to node termination #1280

Comments

michaelklishin commented Jun 30, 2017 • edited Loading

michaelklishin commented Jun 30, 2017

michaelklishin commented Jun 30, 2017 •

edited

Loading