You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Accidentally reproduced while working on an issue in CLI tools:
2017-06-30 10:45:37.649 [error] <0.505.0> Supervisor {<0.505.0>,rabbit_vhost_sup_wrapper} had child rabbit_vhost_sup started with rabbit_vhost_sup_wrapper:start_vhost_sup(<<"test_vhost1">>) at undefined exit with reason {'EXIT',{{badmatch,{error,{no_such_vhost,<<"test_vhost1">>}}},[{rabbit_recovery_terms,start,1,[{file,"src/rabbit_recovery_terms.erl"},{line,51}]},{rabbit_queue_index,start,2,[{file,"src/rabbit_queue_index.erl"},{line,502}]},{rabbit_variable_queue,start,2,[{file,"src/rabbit_variable_queue.erl"},{line,482}]},{rabbit_priority_queue,start,2,[{file,"src/rabbit_priority_queue.erl"},{line,92}]},{rabbit_amqqueue,recover,1,[{file,"src/rabbit_amqqueue.erl"},{line,236}]},{rabbit_vhost,recover,1,[{file,...},...]},...]}} in context start_error
2017-06-30 10:45:37.649 [info] <0.505.0> Making sure data directory '/var/folders/b3/6thx2ls900z6h456z2xb658m0000gn/T/rabbitmq-test-instances/rabbit/mnesia/rabbit/msg_stores/vhosts/BPY12RLD3HXBADXVD0QCEE7DC' for vhost 'test_vhost1' exists
2017-06-30 10:45:37.650 [error] <0.505.0> Supervisor {<0.505.0>,rabbit_vhost_sup_wrapper} had child rabbit_vhost_sup started with rabbit_vhost_sup_wrapper:start_vhost_sup(<<"test_vhost1">>) at undefined exit with reason {'EXIT',{{badmatch,{error,{no_such_vhost,<<"test_vhost1">>}}},[{rabbit_recovery_terms,start,1,[{file,"src/rabbit_recovery_terms.erl"},{line,51}]},{rabbit_queue_index,start,2,[{file,"src/rabbit_queue_index.erl"},{line,502}]},{rabbit_variable_queue,start,2,[{file,"src/rabbit_variable_queue.erl"},{line,482}]},{rabbit_priority_queue,start,2,[{file,"src/rabbit_priority_queue.erl"},{line,92}]},{rabbit_amqqueue,recover,1,[{file,"src/rabbit_amqqueue.erl"},{line,236}]},{rabbit_vhost,recover,1,[{file,...},...]},...]}} in context start_error
2017-06-30 10:45:37.650 [error] <0.505.0> Supervisor {<0.505.0>,rabbit_vhost_sup_wrapper} had child rabbit_vhost_sup started with rabbit_vhost_sup_wrapper:start_vhost_sup(<<"test_vhost1">>) at {restarting,<0.506.0>} exit with reason reached_max_restart_intensity in context shutdown
2017-06-30 10:45:37.650 [error] <0.328.0> Supervisor rabbit_vhost_sup_sup had child rabbit_vhost started with rabbit_vhost_sup_wrapper:start_link(<<"test_vhost1">>) at <0.505.0> exit with reason shutdown in context child_terminated
2017-06-30 10:45:37.650 [error] <0.328.0> Supervisor rabbit_vhost_sup_sup had child rabbit_vhost started with rabbit_vhost_sup_wrapper:start_link(<<"test_vhost1">>) at <0.505.0> exit with reason reached_max_restart_intensity in context shutdown
2017-06-30 10:45:37.650 [info] <0.338.0> Stopping message store for directory '/var/folders/b3/6thx2ls900z6h456z2xb658m0000gn/T/rabbitmq-test-instances/rabbit/mnesia/rabbit/msg_stores/vhosts/9FYW91Z090TQ8GC4PKLCNIK12/msg_store_persistent'
2017-06-30 10:45:37.650 [error] <0.257.0> Supervisor rabbit_sup had child rabbit_vhost_sup_sup started with rabbit_vhost_sup_sup:start_link() at <0.328.0> exit with reason shutdown in context child_terminated
2017-06-30 10:45:37.650 [info] <0.371.0> Stopping message store for directory '/var/folders/b3/6thx2ls900z6h456z2xb658m0000gn/T/rabbitmq-test-instances/rabbit/mnesia/rabbit/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2017-06-30 10:45:37.650 [error] <0.257.0> Supervisor rabbit_sup had child rabbit_vhost_sup_sup started with rabbit_vhost_sup_sup:start_link() at <0.328.0> exit with reason reached_max_restart_intensity in context shutdown
To reproduce this, run an integration test that creates a vhost, does something very quickly
(e.g. in our CLI suite, runs a command that fails validation immediately) and deletes the vhost.
As a result, one of the vhost supervision tree processes can trip up because the record no longer
exists due to a concurrent deletion.
Such cases should be logged as errors and ignored otherwise.
The text was updated successfully, but these errors were encountered:
michaelklishin
changed the title
rabbit_vhost_sup_wrapper:start_vhost_sup/1 does not handle non-existent vhosts gracefully
Recovery terms process tripped up by a non-existent vhost leads to node termination
Jun 30, 2017
michaelklishin
changed the title
Recovery terms process tripped up by a non-existent vhost leads to node termination
Multiple processes tripped up by a non-existent vhost leads to node termination
Jun 30, 2017
Digging deeper and fixing a bunch of terminations here and there (instead we log an error and do nothing), I still can reproduce node shutdown because vhost watcher can run into a non-existent vhost. That is by design, see #1158. We will revisit what default strategy should be .
In the case of a vhost creation and immediate deletion those
operations can end up executing concurrently. This changes
a number of things to be more resilient:
* Default vhost restart strategy changes to "continue" (previosly "ignore").
It's a tough default to pick but we believe the damage of a single vhost terminating
the entire node that may have a lot of users is greater than the need to set up an alternative
strategy for environments where only one vhost is used. Note that the event of
message store or vhost supervision tree termination should be very rare.
Per discussion with @kjnilsson.
* For vhost [message store] recovery, we now log "no_such_vhost" errors
in a sensible way instead of a bunch of process terminations.
* Max restart intensity settings for rabbit_vhost_sup_wrapper are
changed to accommodate integration test suites that rapidly set up
temporary vhost and delete them. In most failure message store scenarios
the new value is still low enough to lead to vhost supervision tree
termination when recovery is impossible.
There are also drive by doc changes.
Fixes#1280.
Intristic child causes the supervisor to stop with the `shutdown`
signal, which will crash the node with `stop_node` strategy.
This can trigger a node crash in case of race condition between
deleting a vhost and a watcher timeout.
Fixes [#1280]
Accidentally reproduced while working on an issue in CLI tools:
To reproduce this, run an integration test that creates a vhost, does something very quickly
(e.g. in our CLI suite, runs a command that fails validation immediately) and deletes the vhost.
As a result, one of the vhost supervision tree processes can trip up because the record no longer
exists due to a concurrent deletion.
Such cases should be logged as errors and ignored otherwise.
The text was updated successfully, but these errors were encountered: