Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only call vhost status for fully started nodes #1482

Merged
merged 4 commits into from
Jan 25, 2018

Conversation

hairyhum
Copy link
Contributor

If a node is restarting, vhost status can throw an error because an ETS table for vhosts is not created yet.
This PR makes sure vhosts status only called for fully started nodes.

Make it return only nodes which are fully started
instead of what mnesia thinks is started.
In all places we use them, we assume they should
be started.

[#154569776]
@hairyhum hairyhum added this to the 3.7.3 milestone Jan 23, 2018
@michaelklishin michaelklishin changed the title Rabbitmq server vhost sup sup badarg Only call vhost status for fully started nodes Jan 24, 2018
Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests hang unless 85dcb49 is reverted. The smallest example to try: gmake ct-channel_interceptor.

@michaelklishin
Copy link
Member

Here's one thing that definitely looks suspicious in the node log:

2018-01-24 15:38:39.112 [info] <0.198.0> Adding vhost '/'
2018-01-24 15:38:39.167 [info] <0.198.0> Creating user 'guest'
2018-01-24 15:38:39.174 [info] <0.198.0> Setting user tags for user 'guest' to [administrator]
2018-01-24 15:38:39.181 [info] <0.198.0> Setting permissions for 'guest' in '/' to '.*', '.*', '.*'
2018-01-24 15:38:39.199 [info] <0.462.0> started TCP Listener on [::]:21048
2018-01-24 15:38:39.204 [info] <0.478.0> started SSL Listener on [::]:21049
2018-01-24 15:38:39.217 [info] <0.198.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rmq-ct-register_interceptor-1-21048@localhost'
2018-01-24 15:38:39.228 [info] <0.198.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rmq-ct-register_interceptor-1-21048@localhost'
2018-01-24 15:38:39.229 [info] <0.33.0> Application rabbit started on node 'rmq-ct-register_interceptor-1-21048@localhost'
2018-01-24 15:38:39.229 [info] <0.33.0> Application rabbitmq_ct_helpers started on node 'rmq-ct-register_interceptor-1-21048@localhost'
2018-01-24 15:38:39.230 [info] <0.33.0> Application rabbitmq_ct_client_helpers started on node 'rmq-ct-register_interceptor-1-21048@localhost'
2018-01-24 15:38:39.230 [notice] <0.89.0> Changed loghwm of /Users/antares/Development/RabbitMQ/umbrella.git/deps/rabbit/logs/ct_run.ct_rabbit@warp10.2018-01-24_15.37.49/deps.rabbit.channel_interceptor_SUITE.logs/run.2018-01-24_15.37.50/log_private/rmq-ct-register_interceptor-1-21048@localhost/log/rmq-ct-register_interceptor-1-21048@localhost.log to 50
2018-01-24 15:38:39.293 [info] <0.5.0> Server startup complete; 0 plugins started.
2018-01-24 15:38:40.651 [info] <0.407.0> node ct_rabbit@warp10 up
2018-01-24 15:38:41.238 [info] <0.523.0> accepting AMQP connection <0.523.0> (127.0.0.1:51527 -> 127.0.0.1:21048)
2018-01-24 15:38:41.287 [error] <0.523.0> Error on AMQP connection <0.523.0> (127.0.0.1:51527 -> 127.0.0.1:21048, vhost: 'none', user: 'guest', state: opening), channel 0:
 {handshake_error,opening,
                 {amqp_error,internal_error,
                             "access to vhost '/' refused for user 'guest': vhost '/' is down",
                             'connection.open'}}
2018-01-24 15:38:41.291 [info] <0.523.0> closing AMQP connection <0.523.0> (127.0.0.1:51527 -> 127.0.0.1:21048, vhost: 'none', user: 'guest')
2018-01-24 15:52:34.032 [info] <0.407.0> node ct_rabbit@warp10 down: connection_closed
2018-01-24 15:53:05.802 [info] <0.53.0> SIGTERM received - shutting down
2018-01-24 15:53:05.802 [info] <0.198.0> Peer discovery backend rabbit_peer_discovery_classic_config does not support registration, skipping unregistration.
2018-01-24 15:53:05.803 [info] <0.478.0> stopped SSL Listener on [::]:21049

@michaelklishin
Copy link
Member

@hairyhum and I have a decent hypothesis about what's going on: a newly booting node will try to create the default vhost on all running nodes but with the new all_nodes/0 implementation there will be none (since the node hasn't finished booting yet). So it's a chicken and egg problem. More importantly, all_nodes/0 doesn't have to change for this PR to address its original issue.

@hairyhum
Copy link
Contributor Author

With all_running/0 changes reverted a broker can now report that a vhost is failed on some nodes in the management UI when a node is restarting. This is still better than returning 500 and polluting logs.

@Ayanda-D
Copy link
Contributor

Interesting. Although a stricter all_running clause which verifies both from mnesia, and is_running per acquired node would be good, for instances like these where we expect the vhost's supervision tree to already been setup. i,e. rabbit running already, not just the node. As it stands, context of running in rabbit_nodes module doesn't make sense. Something for the future 👍

@Ayanda-D
Copy link
Contributor

& in context of info calls such as that from the UI or CTL doing rabbit_nodes:all_running/0 only without checking the application as well. Now that doesn't make sense.

@hairyhum
Copy link
Contributor Author

hairyhum commented Jan 24, 2018

I think I've made a mistake when created a separate ETS tables for vhost Pid registry. It should have went to the vhost table in mnesia. It's not that mnesia is easier to manage, but at least it's consistent with itself.

@michaelklishin what do you think of moving vhost Pids to mnesia? Of course we cannot do that in 3.7 anymore, but can be something for 3.8 and above. So that will be a separate PR.

@michaelklishin
Copy link
Member

A node-local table makes sense to me. I don't have an opinion on whether it would be meaningfully easier to maintain than the ETS version.

@michaelklishin michaelklishin merged commit 7429f0c into v3.7.x Jan 25, 2018
@lukebakken lukebakken deleted the rabbitmq-server-vhost-sup-sup-badarg branch January 25, 2018 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants