Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the inconsistency logic in method updateRenewalThreshold(). #1071

Merged
merged 4 commits into from
Jun 25, 2018

Conversation

neoremind
Copy link
Contributor

Fix the inconsistency logic in method updateRenewalThreshold(). Because it does not reflect the expected intention based on the comment.
updateRenewalThreshold will be qualified to run if renewal count * 2 is bigger than

serverConfig.getRenewalPercentThreshold() * expectedNumberOfRenewsPerMin

NOT

serverConfig.getRenewalPercentThreshold() * numberOfRenewsPerMinThreshold

…intention based on the comment in the method of updateRenewalThreshold().
@holy12345
Copy link
Contributor

@neoremind
Hi First of all I think you logic is right : )
In fact this Issues i have already submit a PR, but the author think i should add some test code. But I not find a best way to test this code , so I close the PR(#1020) If you find the best way to test the code, please tell me.

thanks

@neoremind
Copy link
Contributor Author

@holy12345 thanks for your reply. I am working on adding test case for the scenario, but before that I got one important question to confirm #1073. Feel free to add comments.

@neoremind
Copy link
Contributor Author

neoremind commented May 16, 2018

@holy12345 @qiangdavidliu
I have added test case for this scenario, please review my code. Many thanks.

I will clarify briefly about what I have done. I create a new testcase named TimeConsumingInstanceRegistryTest.java which will run a bunch of sequential events in 2 minutes. I have added detailed comments about the steps that will be executed.
Here I will paste them out for you to take a look.

Verify the following behaviors, the test case will run for 2 minutes.
1. Registration of new instances.
2. Lease expiration.
3. NumOfRenewsPerMinThreshold will be updated. Since this threshold will be updated according to EurekaServerConfig.getRenewalThresholdUpdateIntervalMs(), and discovery client will try to get applications count from peer or remote Eureka servers, the count number will be used to update threshold.
Below shows the time line of a bunch of events in 120 seconds. Here the following setting are configured during registry startup: eureka.renewalThresholdUpdateIntervalMs=5000, eureka.evictionIntervalTimerInMs=10000
       TimeInSecs 0          15         30    40   45         60        75   80      90         105       120
                  |----------|----------|------|----|----------|---------|----|-------|----------|---------|
       Events    (1)        (2)               (3)  (4)        (5)       (6)  (7)                (8)       (9)
       
(1). Remote server started on random port, local registry started as well. 50 instances will be registered to local registry with application name of LOCAL_REGION_APP_NAME and lease duration set to 30 seconds. At this time isLeaseExpirationEnabled=false, getNumOfRenewsPerMinThreshold= (50*2 + 1(initial value))*85%=86
(2). 45 out of the 50 instances send heartbeats to local registry.
(3). Check registry status, isLeaseExpirationEnabled=false, getNumOfRenewsInLastMin=0, getNumOfRenewsPerMinThreshold=86, registeredInstancesNumberOfMYLOCALAPP=50
(4). 45 out of the 50 instances send heartbeats to local registry.
(5). Accumulate one minutes data, and from now on, isLeaseExpirationEnabled=true, getNumOfRenewsInLastMin=90. Because lease expiration is enabled, and lease for 5 instance are expired, so evict 5 instances.
(6). 45 out of the 50 instances send heartbeats to local registry.
(7). Check registry status, isLeaseExpirationEnabled=true, getNumOfRenewsInLastMin=90, getNumOfRenewsPerMinThreshold=86, registeredInstancesNumberOfMYLOCALAPP=45
Remote region add another 150 instances to application of LOCAL_REGION_APP_NAME. This will make PeerAwareInstanceRegistryImpl.updateRenewalThreshold() to refresh AbstractInstanceRegistry.numberOfRenewsPerMinThreshold.
(8). 45 out of the 50 instances send heartbeats to local registry.
(9). Check registry status, isLeaseExpirationEnabled=false, getNumOfRenewsInLastMin=90, getNumOfRenewsPerMinThreshold=256, registeredInstancesNumberOfMYLOCALAPP=45
Note that there is a thread retrieving and printing out registry status for debugging purpose.

After reviewing the history of method updateRenewalThreshold(). I assume the logic is for the following scenario.
Assuming there are 2 servers, server-A and server-B, each has total 100 heartbeats expected, and expiration threshold is 85. After sometime, server-A becomes an isolated island, losting connection with all clients and peer server-B. Then server-A enables preservation mode to avoid evict many instances. After a while when connection comes back to normal, it finds there are total 200 heartbeats expected from server-B which satisfy the condition of updateRenewalThreshold(), if the count is bigger than 85% * numberOfRenewsPerMinThreshold, then server-A will update its renewal threshold to 170 as what server-B does. And after a while with new heartbeats to server-B and replicating to server-A, server-A will catch up and eventually the two servers are consistent.
@qiangdavidliu could you help me review the case? or if you have more insights. Thanks!

@neoremind
Copy link
Contributor Author

@qiangdavidliu would you please review this PR? I suppose you are busy working, but still really appreciate if you could take a look. many thanks. :-)

@qiangdavidliu
Copy link
Contributor

Thanks @neoremind . I'm a little busy right now, but I'll try to take a look at this when I can.

@neoremind
Copy link
Contributor Author

@qiangdavidliu Thanks very much! Take your time.

@holy12345
Copy link
Contributor

I want to more talk about this issue. In my mind, if client has not send heartbeat,eureka server will remove it. but during remove it , Eureka server why not update those number .(Eureka server use a background task do update two numbers).

I understand this is Eureka architect, but can you explain more why design like this.

I will appreciate you answer.

Thanks

@neoremind
Copy link
Contributor Author

neoremind commented May 25, 2018

@holy12345 As far as I know, Eureka will update numberOfRenewsPerMinThreshold in 4 cases:

  • Register new instance.
  • Cancel one instance proactively.
  • Eureka server opening for traffic and this value is the number of instances replicated from peer.
  • In updateRenewalThreshold() method where it will be updated in 15mins in background.

But in expired cases, the value will not be updated since self-preservation relies on the value to determine if heartbeats drop below 85% of the expected number, so Eureka should not update this value when removing expired instances.

@holy12345
Copy link
Contributor

Thanks for you answer, but in my mind evict method just remove client information, but not update those two numbers Immediately. Update is on the background task.
Why not update immediately when evict method remove client information.

@holy12345
Copy link
Contributor

OK I get it.

But in expired cases, the value will not be updated since self-preservation relies on the value to determine if heartbeats drop below 85% of the expected number, so Eureka should not update this value when removing expired instances.

is the reason.

@neoremind
Copy link
Contributor Author

@holy12345 you got it :)

@HikariShine
Copy link

Oh good. Thank you!
Three days ago i find this logic, but i cann't understand it. I think it is a bug, because it seems all situdation will trigger this update, but i have no confidence.
Then i use one day to read the relative logic and one day to search in baidu.
At the third day, i found you pull request, it's a big help for me to understand the logic, thank you!

@neoremind
Copy link
Contributor Author

@qiangdavidliu Sorry to bring you back, would you please review this PR and test case? I think this could help a lot on the core logic verification. Many thanks!

@holy12345
Copy link
Contributor

All guys
I want to know more about this, there is issues also in Netflix production? I agree with @neoremind (its a core logic). so cloud you please tell more things about Netflix how to use eureka in production

Great thanks @qiangdavidliu

@qiangdavidliu
Copy link
Contributor

@neoremind @holy12345 apologies on not being able to take a look at this, we have been swamped and have not had the bandwidth to work on this.

To answer your question about how we are using this, we are using the existing core logic as is, and have not had any issues with it as far as we can remember.

@mattnelson
Copy link
Contributor

I've seen a situation similar to this in which we had the renewal threshold at 85%. We took down an AZ which resulted in roughly 33% of the registrations disappearing, since this was higher than the threshold, the servers entered self preservation mode and were not able to exit this mode since the update threshold would not update as the services were rescheduled on the remaining AZs.

@neoremind
Copy link
Contributor Author

@mattnelson In normal case, updateRenewalThreshold will only be updated when adding or dropping instances, if self perseveration is enabled, then if the heartbeats from instances drop below 85% of the expected value, eureka server stops to expire instance and will not quit unless the lost instances are back online to register/renew with eureka server.
Eureka will also try to get applications from peers, if the count is bigger than the threshold, it will update the updateRenewalThreshold of itself, this is what this PR works on to adjust and clarify.
In my opinion, if Eureka enters self preservation mode, and you deliberately offline some servers, Eureka will not quit.

@neoremind
Copy link
Contributor Author

neoremind commented Jun 23, 2018

@qiangdavidliu Understand the situation that you are overwhelmed, Netflix grows so fast 👍 If this logic seems unnecessary, how about only adding the test case to eureka?

@qiangdavidliu
Copy link
Contributor

@neoremind have looked over your changes over the weekend, these look good, and I very much appreciate the test cases attached.

I'll merge this and cut a new release.

@qiangdavidliu qiangdavidliu merged commit 07876a7 into Netflix:master Jun 25, 2018
@neoremind
Copy link
Contributor Author

@qiangdavidliu many thanks! Appreciate your help! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants