Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: fix balancer and upgrade gRPC v1.7.x #8828

Closed
gyuho opened this issue Nov 7, 2017 · 16 comments
Closed

clientv3: fix balancer and upgrade gRPC v1.7.x #8828

gyuho opened this issue Nov 7, 2017 · 16 comments
Milestone

Comments

@gyuho
Copy link
Contributor

gyuho commented Nov 7, 2017

clientv3: change balancer for gRPC v1.7.x

v1.6.0

  1. cluster of endpoints A and B
  2. pin A
  3. updateNotifyLoop case upc == nil b.notifyCh <- A
  4. A becomes blackholed
  5. PUT request context times out on A
  6. b.notifyCh <- [] in retry.go to drain connection A
  7. grpc.tearDown(errConnDrain) on A
  8. down(errConnDrain) on A
  9. unpin A
  10. pin B
  11. updateNotifyLoop case upc == nil b.notifyCh <- B
  12. following PUT (retry) request succeeds

v1.7.x

No more grpc.tearDown(errConnDrain), only subConnection.down(errConnClosing).
So timing has changed. Custom balancer needs to wait until errConnClosing to unpin an endpoint.

  1. cluster of endpoints A and B
  2. pin A
  3. updateNotifyLoop case upc == nil b.notifyCh <- A
  4. A becomes blackholed
  5. PUT request context times out on A
  6. b.notifyCh <- [] in retry.go to drain connection A
  7. (ac *addrConn).tearDown(errConnDrain) in clientconn.go
  8. handleSubConnStateChange(A,SHUTDOWN)
  9. A connection state changes from READY to SHUTDOWN
  10. updateNotifyLoop case upc == nil b.notifyCh <- A <=== need fix!
  11. down(errConnClosing) on A
  12. unpin A
  13. pin B
  14. lbWatcher RemoveSubConn(B) from step 10 <=== need fix!
  15. updateNotifyLoop case upc == nil b.notifyCh <- B
  16. handleSubConnStateChange(B,SHUTDOWN)
  17. down(errConnClosing) on B
  18. unpin B

So with gRPC v1.7.x, we are sending A while A gets blackholed after context timeout but before A gets unpinned. Wile A is pinned, if A is notified again, lbWatcher removes B sub-connection, when B is the next pinned endpoint.

Balancer needs fix by either

  1. make notify logic aware of endpoint health status (not send unhealthy pinned endpoint)
  2. fix racey notify logic between gRPC and balancer Up's down function
  3. make drain operation synchronous (when sending [] to notify channel)

This gRPC change has been failing TestBalancerUnderBlackholeNoKeepAlive*.

@gyuho gyuho added this to the v3.3.0 milestone Nov 7, 2017
@xiang90
Copy link
Contributor

xiang90 commented Nov 7, 2017

it seems like a bug at gRPC side or our analysis has an issue.

basically after step 15 in 1.7x, step 16, 17, 18 is a side effect of the previous notification of a. but b should eventually be up again at least.

so what will happen after 18?

@gyuho
Copy link
Contributor Author

gyuho commented Nov 7, 2017

@xiang90

so what will happen after 18?

It starts over by notifying A and B, and pins blackholed endpoint A.

it seems like a bug at gRPC side or our analysis has an issue.

With gRPC v1.6.0, balancer unpins on tearDown(errConnDrain), where it drains blackholed endpoint right after marking as unhealthy from context errors.

With gRPC v1.7.2, balancer waits until state change is propagated to the balancer, and calls down(errConnClosing). So there's timing difference grpc/grpc-go#1649 (comment). The current balancer implementation does not unpin blackholed endpoint until errConnClosing. So it's possible that

  1. blackholed A is pinned
  2. select case upc == nil:
  3. PUT times out
  4. c.balancer.next()
  5. in select case upc == nil:, select case msg := <-b.updateAddrsC
  6. b.notifyAddrs(notifyNext)
  7. b.notifyCh <- []
  8. b.notifyCh <- A,B
  9. again, select case upc == nil: since A is still pinned
  10. b.notifyCh <- A
  11. lbWatcher RemoveSubConn(B) while A is blackholed, and B is the one we want to pin

I will investigate more.

@xiang90
Copy link
Contributor

xiang90 commented Nov 7, 2017

eventually a should be unpined due to drain, then b should be pined. we shoud figure out why this is not happening instead of "fixing" the race.

@gyuho
Copy link
Contributor Author

gyuho commented Nov 7, 2017

eventually a should be unpined due to drain

Correct, A gets unpinned.

then b should be pined

Correct, B get pinned after A gets unpinned.

Problem is, afterwards, B gets unpinned when we notify blackholed A before A is unpinned. This removes subconnection B. The subconnection removal is happening after B gets unpinned, because there's no synchronization between gRPC lbWatcher and our updateNotifyLoop. So, B gets pinned but unpinned right away.

@xiang90
Copy link
Contributor

xiang90 commented Nov 7, 2017

So, B gets pinned but unpinned right away.

will it be pined again?

@gyuho
Copy link
Contributor Author

gyuho commented Nov 7, 2017

So, B gets pinned but unpinned right away.

will it be pined again?

Yes, but not always. Balancer could pin A instead, while it's blackholed (because both A and B are marked unhealthy, so choose whichever up first). And test times out.

@xiang90
Copy link
Contributor

xiang90 commented Nov 7, 2017

Balancer could pin A instead, while it's blackholed (because both A and B are marked unhealthy, so choose whichever up first). And test times out.

but then b should be pined again, and things should get stabilized, right?

@gyuho
Copy link
Contributor Author

gyuho commented Nov 7, 2017

but then b should be pined again

It was pinning A instead and TestBalancerUnderBlackholeNoKeepAlive* was failing because of that--failed because endpoint switch didn't happen within timeout.

Since this is after B gets unpinned (so marked as unhealthy), there's no guarantee that B will be re-pinned within the timeout.

@xiang90
Copy link
Contributor

xiang90 commented Nov 7, 2017

Since this is after B gets unpinned (so marked as unhealthy), there's no guarantee that B will be re-pinned within the timeout.

i am more curious if b will be eventually pined regardless of the timeout. if it is the case, then all we need to fight against is the timing issue.

@gyuho
Copy link
Contributor Author

gyuho commented Nov 7, 2017

b will be eventually pined regardless of the timeout

I believe we cannot guarantee that, because both A and B are unhealthy for the same reason--errConnClosing, from balancer's viewpoint. We notify both A and B in notifyAddrs to lbWatcher. Since there's no prioritization between A and B, in theory, it is possible that

  1. notify A, B
  2. Up A
  3. pin A
  4. A times out
  5. notify []
  6. notify A, B
  7. notify A, since A is still pinned
  8. unpin A
  9. Up B
  10. pin B
  11. tearDown B from step 7
  12. unpin B
  13. repeat

@xiang90
Copy link
Contributor

xiang90 commented Nov 7, 2017

@gyuho OK. Let us stop notifying gRPC the endpoints that are in the unhealthy list. It should solve this issue, and it is inline with what we are going to do in the future too. sounds reasonable?

@gyuho
Copy link
Contributor Author

gyuho commented Nov 9, 2017

@xiang90

Here's the problem when notifying only healthy addresses (even with gRPC v1.6.x):

  1. cluster of A, B, C, where A is the leader
  2. balancer with A, B, C
  3. notify A, B, C (pinned=="")
  4. balancer pins A
  5. stop B(follower)
  6. stop A(leader)
  7. balancer marks A as unhealthy
  8. balancer unpins A
  9. notify B, C (pinned=="") <=== problem!
  10. pin C
  11. linearizable get on C
  12. fails due to "etcdserver: request timed out" (A, B stopped)
  13. balancer marks C as unhealthy
  14. restart A (now A, C are live, B is stopped)
  15. notify [] to drain all current connections (from step 13)
  16. balancer marks C as unhealthy
  17. unpin C
  18. notify B (since A, C are marked as unhealthy) <=== problem!
  19. linearizable get times out

Problem with step 9 is B was never pinned, so balancer thinks it's heathy. Or even B was marked unhealthy, it might have been removed from unhealthy lists after a few seconds.

Problem with step 16 is it's still notifying B, rather than A, C (since A and C are still marked as unhealthy).

@xiang90
Copy link
Contributor

xiang90 commented Nov 9, 2017

@gyuho this is actually ok. retry l-get again or enable keepalive should fix it.

@gyuho
Copy link
Contributor Author

gyuho commented Nov 9, 2017

@xiang90

I believe even retry wouldn't help. In the case above, balancer is stuck in step 18 (when notifying B). Even after A and C get removed from unhealthy, it is still stuck waiting for a new endpoint up.

https://github.com/coreos/etcd/blob/05e5b3b62da78d9fddfb89bf852586049b4bae76/clientv3/balancer.go#L228-L236

And not sure how keepalive would help, when there's no endpoint to ping.

@xiang90
Copy link
Contributor

xiang90 commented Nov 9, 2017

@gyuho well, maybe you accidentally remove the notify loop at https://github.com/coreos/etcd/blob/master/clientv3/health_balancer.go#L153? this is very important, so that when A,C is back, A, C will be sent to gRPC.

@gyuho
Copy link
Contributor Author

gyuho commented Nov 9, 2017

@xiang90 Good catch!

Indeed, my patch was handling empty pinned address wrong, in that code path. Now test passes. Will run more tests and push a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants