Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1175: Add synchronization of disconnect procedure for server con… #1179

Merged
merged 6 commits into from
Feb 14, 2019

Conversation

aleksostapenko
Copy link
Contributor

…nections.

Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug scenarios in #1175 and #1080 are quite sophisticated - good analysing job! I have only several questions/objections.

I think we need a better architecture #687 (comment)

@@ -126,6 +126,13 @@ static const unsigned long tfw_srv_tmo_vals[] = { 1, 10, 100, 250, 500, 1000 };
TFW_WARN_MOD_ADDR(sock_srv, check, addr, TFW_NO_PORT, fmt, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the file copyright

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -143,6 +150,7 @@ tfw_sock_srv_connect_try(TfwSrvConn *srv_conn)
&sk);
if (r) {
TFW_ERR("Unable to create server socket\n");
tfw_srv_conn_stop(srv_conn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function may be called multiple times on the timer and each time if can fail with the socket creation and each time you put the server - this seems not right. Why do we need to affect server refcounter here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we exit here on error from ss_sock_create(), there will be no more calls to tfw_sock_srv_connect_try() since the timer is expired (and, consequently, deactivated) and not rearmed (as e.g. below - after ss_connect() error). So, in fact - the connection will be stopped here and we need to set appropriate bit and put the server. This behavior provides determination of all states in which connection is allowed to be, and support consistency with the waiting cycle in tfw_sock_srv_disconnect() (as described in comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the error may happen at Nth reconnect attempt, and connection will never be connected again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added reconnecting on OOM.

@@ -229,10 +237,6 @@ tfw_sock_srv_connect_try_later(TfwSrvConn *srv_conn)
{
unsigned long timeout;

/* Don't rearm the reconnection timer if we're about to shutdown. */
if (unlikely(!ss_active()))
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we reconnect if we're about to shutdown? This may lead new reconnection/shutdown races.

Copy link
Contributor Author

@aleksostapenko aleksostapenko Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check doesn't save us from reconnections on shutdown: failovering thread can pass this check before we set false for __ss_active during shutdown process, and we can get into tfw_sock_srv_disconnect() during live reconnection attempt. To prevent all these races - the waiting cycle had been introduced in tfw_sock_srv_disconnect(); so now if we missed connection during shutdown and it has time to begin tfw_sock_srv_connect_try_later(), it will still be stopped before reconnection attempt - due to successful timer deactivation in cycle here; in the rare worst case (which we cannot avoid in any way, in current architecture), when the timer handler is already started, and del_timer_sync() failed and return 0 (after handler finished), then three variants are possible:

  1. Connection will be stopped here (see comment above); update: or connection destructor will be called (in case of OOM error);
  2. Connection destructor will be called here and connection will be definitely stopped in it (since TFW_CONN_B_DEL is already set);
  3. Connection is successfully restored, its reference count will became not equal TFW_CONN_DEATHCNT - then connection will be stopped in usual way, via tfw_connection_close() here.

In all these cases the race conditions are avoided, since we are waiting in tfw_sock_srv_disconnect() until connection will be stopped consistently before cleaning any of its resources.

@@ -307,18 +311,11 @@ tfw_srv_conn_release(TfwSrvConn *srv_conn)
* in deferred context after a short pause (in a timer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, tfw_srv_conn_release() is called only in this file, so please make it static and remove it's declaration from server.h.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

1. Resolve implicit declaration;
2. Remove 'tfw_connection_close()' call from 'tfw_sock_srv_connect_try()';
this call looked obsolete: sock's state is not established during this call,
so there are no usefull actions performed.
@@ -434,33 +430,76 @@ static const SsHooks tfw_sock_srv_ss_hooks = {
* If a server connection is closed at the time this function runs, then
* it had been closed by a backend before the shutdown, and the connection
* is still in failover (not re-connected yet). The resources attached to
* the connection had not been released, and it has to be done forcefully.
* the connection may had not been released, and it has to be done forcefully.
*/
static int
tfw_sock_srv_disconnect(TfwConn *conn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function always returns 0, so it can be void. Frankly, the non-success return code from disconnect() is quite a strange thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be one case when the function can return non-zero value: if tfw_connection_close() failed, so we cannot stop connection in this case, and should return error code.

@@ -143,6 +150,7 @@ tfw_sock_srv_connect_try(TfwSrvConn *srv_conn)
&sk);
if (r) {
TFW_ERR("Unable to create server socket\n");
tfw_srv_conn_stop(srv_conn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the error may happen at Nth reconnect attempt, and connection will never be connected again.

Copy link
Contributor

@vankoven vankoven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants