Refactor the replication connections #280

datoug · 2017-08-21T21:53:00Z

Problem to solve

FD leak caused by client side not draining from the read stream. (related fixes: fix replicator FD leak: client side of websocket needs to drain all messages before returning otherwise the underlying connection will not be closed #265, Replicator FD leak: make sure client side keeps reading until it gets an error #266)
Bug on shutdown path that is causing sealed message not made to store.

Design philosophies
read pump reads from stream, write pump writes to stream
read pump communicate with write pump using an internal channel. Read pump writes to the internal channel, and write pump reads from it
Graceful shutdown sequence:

read pump gets a read error (remote shuts down the connection)
read pump closes the internal channel, then exits
write pump detects the internal channel is closed
write pump calls stream.Done(), and exits

Other changes
Removed the error handling cases in OutConn.go since same error handling is also done in store. Now replicator serves as a proxy only.

coveralls · 2017-08-21T22:17:52Z

Coverage decreased (-0.05%) to 67.267% when pulling dea872b on rr_f into 2a2575a on master.

coveralls · 2017-08-21T22:17:52Z

Coverage increased (+0.2%) to 67.497% when pulling dea872b on rr_f into 2a2575a on master.

… and close the write pump

datoug · 2017-08-24T04:36:01Z

@kirg For the empty credit case(https://github.com/uber/cherami-server/pull/280/files#diff-f8daca02f7ff8d4e7df757fe00f19bc4L148), I think we can rely on the idle timeout(creditFlowTimeout) to eventually close the pumps, instead of introducing another notification channel.

coveralls · 2017-08-24T04:57:24Z

Coverage increased (+0.2%) to 67.515% when pulling 152bd6b on rr_f into 2a2575a on master.

kirg · 2017-08-28T23:42:49Z

services/replicator/inconn.go

-	}
-	conn.logger.Info("in connection closed")
+func (conn *inConnection) shutdown() {
+	close(conn.shutdownCh)


You might want to put a log line here .. so you know the reasons that the go-routines have gone, etc.

kirg · 2017-08-28T23:44:11Z

services/replicator/inconn.go

 		}
 	}
 }

 func (conn *inConnection) writeMsgsStream() {
 	defer conn.stream.Done()
+	defer conn.wg.Done()


you should probably have "defer conn.wg.Done()" before "conn.stream.Done()" ..

kirg · 2017-08-28T23:45:00Z

services/replicator/inconn.go

 				conn.extentCreditExpiration()
 				localCredits += credit
 			case <-time.After(creditFlowTimeout):
 				conn.logger.Warn("credit flow timeout")
 				if conn.isCreditFlowExpired() {
 					conn.logger.Warn("credit flow expired")
-					go conn.close()
+					return


wait (and quit) on the "shutdownCh" here ..

kirg · 2017-08-28T23:45:17Z

services/replicator/inconn.go

 				conn.extentCreditExpiration()
 				localCredits += credit
 			case <-flushTicker.C:
 				if err := conn.stream.Flush(); err != nil {
 					conn.logger.Error(`flush msg failed`)
-					go conn.close()
+					return


also wait (and return) on the shutdownCh here ..

kirg · 2017-08-28T23:46:35Z

services/replicator/outconn.go

 }

 func (conn *outConnection) writeCreditsStream() {
 	defer conn.stream.Done()
+	defer conn.wg.Done()


as with inconn, move this up one line ..

kirg · 2017-08-28T23:47:59Z

services/replicator/outconn.go

-			continue readloop
+		conn.m3Client.IncCounter(conn.metricsScope, metrics.ReplicatorOutConnMsgRead)
+		if rmc.GetType() == store.ReadMessageContentType_SEALED {
+			conn.logger.WithField(`SequenceNumber`, rmc.GetSealed().GetSequenceNumber()).Info(`extent sealed`)


for debug/assertion purposes .. do you want to remember that you saw a "sealed" message .. and in case you don't an error from the next read, log an error or something?

I removed that logic since we want replicator be a proxy only now?

I understand .. but do you think it might add value in case we are debugging issues, etc. I suspect we might never hit it though .. so I'll leave it to you.

kirg · 2017-08-28T23:48:59Z

services/replicator/outconn.go

-			conn.logger.WithField(`Message`, msgErr.GetMessage()).Error(`received error from reading msg`)
-			go conn.close()
-			continue readloop
+		conn.m3Client.IncCounter(conn.metricsScope, metrics.ReplicatorOutConnMsgRead)


this will count "SEALED" also as a message .. i think you should not; that way tallying message counts across services will be be easier.

kirg · 2017-08-28T23:56:15Z

services/replicator/replicator.go

-	<-outConn.closeChannel
-
+	outConn.WaitUntilDone()
+	inConn.WaitUntilDone()


i might suggest setting up a waitgroup within this function .. and passing a pointer to the waitgroup to inconn/outconn, etc that they increment/decrement .. and you can just do a wg.Wait() here that will automatically wait for both, etc.

seems lack of encapsulation if we do that way? (although I saw some code in our codebase does that).

kirg · 2017-08-28T23:56:26Z

services/replicator/replicator.go

-	<-outConn.closeChannel
-
+	outConn.WaitUntilDone()
+	inConn.WaitUntilDone()


same as previous comment ..

coveralls · 2017-08-29T22:53:50Z

Coverage increased (+0.3%) to 67.652% when pulling 646aba3 on rr_f into 2a2575a on master.

coveralls · 2017-08-29T23:20:16Z

Coverage increased (+0.1%) to 67.466% when pulling 646aba3 on rr_f into 2a2575a on master.

kirg · 2017-08-29T22:04:04Z

services/replicator/outconn.go

-			continue readloop
+		conn.m3Client.IncCounter(conn.metricsScope, metrics.ReplicatorOutConnMsgRead)
+		if rmc.GetType() == store.ReadMessageContentType_SEALED {
+			conn.logger.WithField(`SequenceNumber`, rmc.GetSealed().GetSequenceNumber()).Info(`extent sealed`)


I understand .. but do you think it might add value in case we are debugging issues, etc. I suspect we might never hit it though .. so I'll leave it to you.

* Replictor: refactor the stream closure mechenism * update * refactor replicator connections * outConn close msg channel in read pump so that inConn can be notified and close the write pump * address comments * revert change on glide.lock

datoug added 3 commits August 17, 2017 14:03

Replictor: refactor the stream closure mechenism

bc04228

update

121d291

refactor replicator connections

dea872b

datoug requested review from thuningxu and kirg August 21, 2017 21:53

outConn close msg channel in read pump so that inConn can be notified…

152bd6b

… and close the write pump

kirg reviewed Aug 28, 2017

View reviewed changes

datoug added 2 commits August 29, 2017 13:30

address comments

802d4b5

revert change on glide.lock

646aba3

kirg approved these changes Aug 30, 2017

View reviewed changes

datoug merged commit 7486ebf into master Aug 30, 2017

datoug deleted the rr_f branch August 30, 2017 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the replication connections #280

Refactor the replication connections #280

datoug commented Aug 21, 2017

coveralls commented Aug 21, 2017

coveralls commented Aug 21, 2017

datoug commented Aug 24, 2017

coveralls commented Aug 24, 2017 •

edited

Loading

kirg Aug 28, 2017

kirg Aug 28, 2017

datoug Aug 29, 2017

kirg Aug 28, 2017

kirg Aug 28, 2017

kirg Aug 28, 2017

kirg Aug 28, 2017

datoug Aug 29, 2017

kirg Aug 29, 2017

kirg Aug 28, 2017

kirg Aug 28, 2017

datoug Aug 29, 2017

kirg Aug 28, 2017

coveralls commented Aug 29, 2017 •

edited

Loading

coveralls commented Aug 29, 2017 •

edited

Loading

kirg Aug 29, 2017

Refactor the replication connections #280

Refactor the replication connections #280

Conversation

datoug commented Aug 21, 2017

coveralls commented Aug 21, 2017

coveralls commented Aug 21, 2017

datoug commented Aug 24, 2017

coveralls commented Aug 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Aug 29, 2017 • edited Loading

coveralls commented Aug 29, 2017 • edited Loading

Choose a reason for hiding this comment

coveralls commented Aug 24, 2017 •

edited

Loading

coveralls commented Aug 29, 2017 •

edited

Loading

coveralls commented Aug 29, 2017 •

edited

Loading