-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #1175: Add synchronization of disconnect procedure for server con… #1179
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug scenarios in #1175 and #1080 are quite sophisticated - good analysing job! I have only several questions/objections.
I think we need a better architecture #687 (comment)
@@ -126,6 +126,13 @@ static const unsigned long tfw_srv_tmo_vals[] = { 1, 10, 100, 250, 500, 1000 }; | |||
TFW_WARN_MOD_ADDR(sock_srv, check, addr, TFW_NO_PORT, fmt, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the file copyright
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tempesta_fw/sock_srv.c
Outdated
@@ -143,6 +150,7 @@ tfw_sock_srv_connect_try(TfwSrvConn *srv_conn) | |||
&sk); | |||
if (r) { | |||
TFW_ERR("Unable to create server socket\n"); | |||
tfw_srv_conn_stop(srv_conn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function may be called multiple times on the timer and each time if can fail with the socket creation and each time you put the server - this seems not right. Why do we need to affect server refcounter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we exit here on error from ss_sock_create()
, there will be no more calls to tfw_sock_srv_connect_try()
since the timer is expired (and, consequently, deactivated) and not rearmed (as e.g. below - after ss_connect()
error). So, in fact - the connection will be stopped here and we need to set appropriate bit and put the server. This behavior provides determination of all states in which connection is allowed to be, and support consistency with the waiting cycle in tfw_sock_srv_disconnect()
(as described in comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the error may happen at Nth reconnect attempt, and connection will never be connected again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added reconnecting on OOM.
@@ -229,10 +237,6 @@ tfw_sock_srv_connect_try_later(TfwSrvConn *srv_conn) | |||
{ | |||
unsigned long timeout; | |||
|
|||
/* Don't rearm the reconnection timer if we're about to shutdown. */ | |||
if (unlikely(!ss_active())) | |||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we reconnect if we're about to shutdown? This may lead new reconnection/shutdown races.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check doesn't save us from reconnections on shutdown: failovering thread can pass this check before we set false
for __ss_active
during shutdown process, and we can get into tfw_sock_srv_disconnect()
during live reconnection attempt. To prevent all these races - the waiting cycle had been introduced in tfw_sock_srv_disconnect()
; so now if we missed connection during shutdown and it has time to begin tfw_sock_srv_connect_try_later()
, it will still be stopped before reconnection attempt - due to successful timer deactivation in cycle here; in the rare worst case (which we cannot avoid in any way, in current architecture), when the timer handler is already started, and del_timer_sync()
failed and return 0 (after handler finished), then three variants are possible:
- Connection will be stopped here (see comment above); update: or connection destructor will be called (in case of OOM error);
- Connection destructor will be called here and connection will be definitely stopped in it (since
TFW_CONN_B_DEL
is already set); - Connection is successfully restored, its reference count will became not equal
TFW_CONN_DEATHCNT
- then connection will be stopped in usual way, viatfw_connection_close()
here.
In all these cases the race conditions are avoided, since we are waiting in tfw_sock_srv_disconnect()
until connection will be stopped consistently before cleaning any of its resources.
tempesta_fw/sock_srv.c
Outdated
@@ -307,18 +311,11 @@ tfw_srv_conn_release(TfwSrvConn *srv_conn) | |||
* in deferred context after a short pause (in a timer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, tfw_srv_conn_release()
is called only in this file, so please make it static and remove it's declaration from server.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
1. Resolve implicit declaration; 2. Remove 'tfw_connection_close()' call from 'tfw_sock_srv_connect_try()'; this call looked obsolete: sock's state is not established during this call, so there are no usefull actions performed.
@@ -434,33 +430,76 @@ static const SsHooks tfw_sock_srv_ss_hooks = { | |||
* If a server connection is closed at the time this function runs, then | |||
* it had been closed by a backend before the shutdown, and the connection | |||
* is still in failover (not re-connected yet). The resources attached to | |||
* the connection had not been released, and it has to be done forcefully. | |||
* the connection may had not been released, and it has to be done forcefully. | |||
*/ | |||
static int | |||
tfw_sock_srv_disconnect(TfwConn *conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function always returns 0
, so it can be void
. Frankly, the non-success return code from disconnect()
is quite a strange thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be one case when the function can return non-zero value: if tfw_connection_close()
failed, so we cannot stop connection in this case, and should return error code.
tempesta_fw/sock_srv.c
Outdated
@@ -143,6 +150,7 @@ tfw_sock_srv_connect_try(TfwSrvConn *srv_conn) | |||
&sk); | |||
if (r) { | |||
TFW_ERR("Unable to create server socket\n"); | |||
tfw_srv_conn_stop(srv_conn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the error may happen at Nth reconnect attempt, and connection will never be connected again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
…nections.