Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple processes tripped up by a non-existent vhost leads to node termination #1280

Closed
michaelklishin opened this issue Jun 30, 2017 · 1 comment
Assignees
Milestone

Comments

@michaelklishin
Copy link
Member

michaelklishin commented Jun 30, 2017

Accidentally reproduced while working on an issue in CLI tools:

2017-06-30 10:45:37.649 [error] <0.505.0> Supervisor {<0.505.0>,rabbit_vhost_sup_wrapper} had child rabbit_vhost_sup started with rabbit_vhost_sup_wrapper:start_vhost_sup(<<"test_vhost1">>) at undefined exit with reason {'EXIT',{{badmatch,{error,{no_such_vhost,<<"test_vhost1">>}}},[{rabbit_recovery_terms,start,1,[{file,"src/rabbit_recovery_terms.erl"},{line,51}]},{rabbit_queue_index,start,2,[{file,"src/rabbit_queue_index.erl"},{line,502}]},{rabbit_variable_queue,start,2,[{file,"src/rabbit_variable_queue.erl"},{line,482}]},{rabbit_priority_queue,start,2,[{file,"src/rabbit_priority_queue.erl"},{line,92}]},{rabbit_amqqueue,recover,1,[{file,"src/rabbit_amqqueue.erl"},{line,236}]},{rabbit_vhost,recover,1,[{file,...},...]},...]}} in context start_error
2017-06-30 10:45:37.649 [info] <0.505.0> Making sure data directory '/var/folders/b3/6thx2ls900z6h456z2xb658m0000gn/T/rabbitmq-test-instances/rabbit/mnesia/rabbit/msg_stores/vhosts/BPY12RLD3HXBADXVD0QCEE7DC' for vhost 'test_vhost1' exists
2017-06-30 10:45:37.650 [error] <0.505.0> Supervisor {<0.505.0>,rabbit_vhost_sup_wrapper} had child rabbit_vhost_sup started with rabbit_vhost_sup_wrapper:start_vhost_sup(<<"test_vhost1">>) at undefined exit with reason {'EXIT',{{badmatch,{error,{no_such_vhost,<<"test_vhost1">>}}},[{rabbit_recovery_terms,start,1,[{file,"src/rabbit_recovery_terms.erl"},{line,51}]},{rabbit_queue_index,start,2,[{file,"src/rabbit_queue_index.erl"},{line,502}]},{rabbit_variable_queue,start,2,[{file,"src/rabbit_variable_queue.erl"},{line,482}]},{rabbit_priority_queue,start,2,[{file,"src/rabbit_priority_queue.erl"},{line,92}]},{rabbit_amqqueue,recover,1,[{file,"src/rabbit_amqqueue.erl"},{line,236}]},{rabbit_vhost,recover,1,[{file,...},...]},...]}} in context start_error
2017-06-30 10:45:37.650 [error] <0.505.0> Supervisor {<0.505.0>,rabbit_vhost_sup_wrapper} had child rabbit_vhost_sup started with rabbit_vhost_sup_wrapper:start_vhost_sup(<<"test_vhost1">>) at {restarting,<0.506.0>} exit with reason reached_max_restart_intensity in context shutdown
2017-06-30 10:45:37.650 [error] <0.328.0> Supervisor rabbit_vhost_sup_sup had child rabbit_vhost started with rabbit_vhost_sup_wrapper:start_link(<<"test_vhost1">>) at <0.505.0> exit with reason shutdown in context child_terminated
2017-06-30 10:45:37.650 [error] <0.328.0> Supervisor rabbit_vhost_sup_sup had child rabbit_vhost started with rabbit_vhost_sup_wrapper:start_link(<<"test_vhost1">>) at <0.505.0> exit with reason reached_max_restart_intensity in context shutdown
2017-06-30 10:45:37.650 [info] <0.338.0> Stopping message store for directory '/var/folders/b3/6thx2ls900z6h456z2xb658m0000gn/T/rabbitmq-test-instances/rabbit/mnesia/rabbit/msg_stores/vhosts/9FYW91Z090TQ8GC4PKLCNIK12/msg_store_persistent'
2017-06-30 10:45:37.650 [error] <0.257.0> Supervisor rabbit_sup had child rabbit_vhost_sup_sup started with rabbit_vhost_sup_sup:start_link() at <0.328.0> exit with reason shutdown in context child_terminated
2017-06-30 10:45:37.650 [info] <0.371.0> Stopping message store for directory '/var/folders/b3/6thx2ls900z6h456z2xb658m0000gn/T/rabbitmq-test-instances/rabbit/mnesia/rabbit/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2017-06-30 10:45:37.650 [error] <0.257.0> Supervisor rabbit_sup had child rabbit_vhost_sup_sup started with rabbit_vhost_sup_sup:start_link() at <0.328.0> exit with reason reached_max_restart_intensity in context shutdown

To reproduce this, run an integration test that creates a vhost, does something very quickly
(e.g. in our CLI suite, runs a command that fails validation immediately) and deletes the vhost.
As a result, one of the vhost supervision tree processes can trip up because the record no longer
exists due to a concurrent deletion.

Such cases should be logged as errors and ignored otherwise.

@michaelklishin michaelklishin added this to the 3.7.0 milestone Jun 30, 2017
@michaelklishin michaelklishin self-assigned this Jun 30, 2017
@michaelklishin michaelklishin changed the title rabbit_vhost_sup_wrapper:start_vhost_sup/1 does not handle non-existent vhosts gracefully Recovery terms process tripped up by a non-existent vhost leads to node termination Jun 30, 2017
@michaelklishin michaelklishin changed the title Recovery terms process tripped up by a non-existent vhost leads to node termination Multiple processes tripped up by a non-existent vhost leads to node termination Jun 30, 2017
@michaelklishin
Copy link
Member Author

Digging deeper and fixing a bunch of terminations here and there (instead we log an error and do nothing), I still can reproduce node shutdown because vhost watcher can run into a non-existent vhost. That is by design, see #1158. We will revisit what default strategy should be .

michaelklishin added a commit that referenced this issue Jul 1, 2017
In the case of a vhost creation and immediate deletion those
operations can end up executing concurrently. This changes
a number of things to be more resilient:

 * Default vhost restart strategy changes to "continue" (previosly "ignore").
   It's a tough default to pick but we believe the damage of a single vhost terminating
   the entire node that may have a lot of users is greater than the need to set up an alternative
   strategy for environments where only one vhost is used. Note that the event of
   message store or vhost supervision tree termination should be very rare.

   Per discussion with @kjnilsson.

 * For vhost [message store] recovery, we now log "no_such_vhost" errors
   in a sensible way instead of a bunch of process terminations.

 * Max restart intensity settings for rabbit_vhost_sup_wrapper are
   changed to accommodate integration test suites that rapidly set up
   temporary vhost and delete them. In most failure message store scenarios
   the new value is still low enough to lead to vhost supervision tree
   termination when recovery is impossible.

There are also drive by doc changes.

Fixes #1280.
hairyhum pushed a commit that referenced this issue Jul 5, 2017
Intristic child causes the supervisor to stop with the `shutdown`
signal, which will crash the node with `stop_node` strategy.
This can trigger a node crash in case of race condition between
deleting a vhost and a watcher timeout.
Fixes [#1280]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant