Skip to content
This repository has been archived by the owner on Feb 18, 2021. It is now read-only.

Refactor the replication connections #280

Merged
merged 6 commits into from
Aug 30, 2017
Merged

Refactor the replication connections #280

merged 6 commits into from
Aug 30, 2017

Conversation

datoug
Copy link
Contributor

@datoug datoug commented Aug 21, 2017

Problem to solve

  1. FD leak caused by client side not draining from the read stream. (related fixes: fix replicator FD leak: client side of websocket needs to drain all messages before returning otherwise the underlying connection will not be closed #265, Replicator FD leak: make sure client side keeps reading until it gets an error #266)
  2. Bug on shutdown path that is causing sealed message not made to store.

Design philosophies
read pump reads from stream, write pump writes to stream
read pump communicate with write pump using an internal channel. Read pump writes to the internal channel, and write pump reads from it
Graceful shutdown sequence:

  1. read pump gets a read error (remote shuts down the connection)
  2. read pump closes the internal channel, then exits
  3. write pump detects the internal channel is closed
  4. write pump calls stream.Done(), and exits

Other changes
Removed the error handling cases in OutConn.go since same error handling is also done in store. Now replicator serves as a proxy only.

@datoug datoug requested review from thuningxu and kirg August 21, 2017 21:53
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) to 67.267% when pulling dea872b on rr_f into 2a2575a on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 67.497% when pulling dea872b on rr_f into 2a2575a on master.

@datoug
Copy link
Contributor Author

datoug commented Aug 24, 2017

@kirg For the empty credit case(https://github.com/uber/cherami-server/pull/280/files#diff-f8daca02f7ff8d4e7df757fe00f19bc4L148), I think we can rely on the idle timeout(creditFlowTimeout) to eventually close the pumps, instead of introducing another notification channel.

@coveralls
Copy link

coveralls commented Aug 24, 2017

Coverage Status

Coverage increased (+0.2%) to 67.515% when pulling 152bd6b on rr_f into 2a2575a on master.

}
conn.logger.Info("in connection closed")
func (conn *inConnection) shutdown() {
close(conn.shutdownCh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to put a log line here .. so you know the reasons that the go-routines have gone, etc.

}
}
}

func (conn *inConnection) writeMsgsStream() {
defer conn.stream.Done()
defer conn.wg.Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should probably have "defer conn.wg.Done()" before "conn.stream.Done()" ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

conn.extentCreditExpiration()
localCredits += credit
case <-time.After(creditFlowTimeout):
conn.logger.Warn("credit flow timeout")
if conn.isCreditFlowExpired() {
conn.logger.Warn("credit flow expired")
go conn.close()
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait (and quit) on the "shutdownCh" here ..

conn.extentCreditExpiration()
localCredits += credit
case <-flushTicker.C:
if err := conn.stream.Flush(); err != nil {
conn.logger.Error(`flush msg failed`)
go conn.close()
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also wait (and return) on the shutdownCh here ..

}

func (conn *outConnection) writeCreditsStream() {
defer conn.stream.Done()
defer conn.wg.Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as with inconn, move this up one line ..

continue readloop
conn.m3Client.IncCounter(conn.metricsScope, metrics.ReplicatorOutConnMsgRead)
if rmc.GetType() == store.ReadMessageContentType_SEALED {
conn.logger.WithField(`SequenceNumber`, rmc.GetSealed().GetSequenceNumber()).Info(`extent sealed`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for debug/assertion purposes .. do you want to remember that you saw a "sealed" message .. and in case you don't an error from the next read, log an error or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed that logic since we want replicator be a proxy only now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand .. but do you think it might add value in case we are debugging issues, etc. I suspect we might never hit it though .. so I'll leave it to you.

conn.logger.WithField(`Message`, msgErr.GetMessage()).Error(`received error from reading msg`)
go conn.close()
continue readloop
conn.m3Client.IncCounter(conn.metricsScope, metrics.ReplicatorOutConnMsgRead)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will count "SEALED" also as a message .. i think you should not; that way tallying message counts across services will be be easier.

<-outConn.closeChannel

outConn.WaitUntilDone()
inConn.WaitUntilDone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i might suggest setting up a waitgroup within this function .. and passing a pointer to the waitgroup to inconn/outconn, etc that they increment/decrement .. and you can just do a wg.Wait() here that will automatically wait for both, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems lack of encapsulation if we do that way? (although I saw some code in our codebase does that).

<-outConn.closeChannel

outConn.WaitUntilDone()
inConn.WaitUntilDone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous comment ..

@coveralls
Copy link

coveralls commented Aug 29, 2017

Coverage Status

Coverage increased (+0.3%) to 67.652% when pulling 646aba3 on rr_f into 2a2575a on master.

@coveralls
Copy link

coveralls commented Aug 29, 2017

Coverage Status

Coverage increased (+0.1%) to 67.466% when pulling 646aba3 on rr_f into 2a2575a on master.

continue readloop
conn.m3Client.IncCounter(conn.metricsScope, metrics.ReplicatorOutConnMsgRead)
if rmc.GetType() == store.ReadMessageContentType_SEALED {
conn.logger.WithField(`SequenceNumber`, rmc.GetSealed().GetSequenceNumber()).Info(`extent sealed`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand .. but do you think it might add value in case we are debugging issues, etc. I suspect we might never hit it though .. so I'll leave it to you.

@datoug datoug merged commit 7486ebf into master Aug 30, 2017
@datoug datoug deleted the rr_f branch August 30, 2017 22:08
datoug added a commit that referenced this pull request Aug 30, 2017
* Replictor: refactor the stream closure mechenism

* update

* refactor replicator connections

* outConn close msg channel in read pump so that inConn can be notified and close the write pump

* address comments

* revert change on glide.lock
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants