-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some responses for /apps and /apps/delta contain inconsistent data #1120
Comments
We consider invalid responses for apps/delta to be a bug, so would very appreciate some feedback :) |
Hi @mgtriffid thanks for the issue and the experiment! The algorithm to compute the full/delta hash code is pretty ancient and I would not be surprised if it has some false signals. In out internal set up (i.e. in the Java EurekaClient) the client just falls back to a full fetch (refresh global state) when it sees a hash mismatch, and this is acceptable to us as we don't look at each individual fetch request, but rather whether the data is consistent over (a relatively short period of) time. Related, we also run a side app internally that continuously calculates the eventual data consistency between the servers. |
Hi @qiangdavidliu , thank you for your response! Could you kindly tell more about that side app, if that doesn't violate your NDA? :) |
PRs are definitely welcome. Internally, we have partitioned our eureka usage to a read/write set that allows the read cluster to autoscale. As for the side server that does data diffing, it is just a simple app that periodically gets data from each eureka write server and publish some metrics after comparing the data. We use it to alert on data inconsistencies. Nothing fancy. |
@qiangdavidliu We have found out that when Eureka system gets into a "registry inconsistency mode" and many clients start executing full registry fetches, all these service instances exhibit high GC pauses for the specific time period, probably due to the downloading of full registries. This behaviour is definitely scary and we would like to know whether we operate at the limit of a Eureka system (at least one that is based on the open-sourced version) or not. We have around 1500 service instances registered in eureka now and the full registry fetch (along with the metadata we attach) is maybe 100MB of size. Do you feel that this may be a problematic setup? Do you have any information about scaling you can provide? |
Hi @pparth we run eureka at much larger scales than 1500 instances (order of magnitude). There are a few things we do differently than the defaults in open source:
|
@qiangdavidliu Can you provide more details on this? Is this using Eureka2? Which is introducing separate read/write clusters which has never progressed past a release candidate 3+ years ago. |
Hi @mattnelson the readonly server is not related to the Eureka2 work on the 2.x branch. That effort is abandon for various reasons, one being dependencies on projects that were never fully realised. We do something pretty simple internally by basically taking the same eureka REST resources, put it in front of a guava cache, which in turn fronts a regular eureka client. This app then just registers with eureka as a regular app. (this app source code is not open source, but it is pretty straight forward). The eureka transport client config contains some configuration to support such a model. For example, set the following configs:
|
@qiangdavidliu Are you using Eureka to store service instance metadata? Do you think that this is a bad practise? |
Submitted PR #1126 for this. Was able to easily reproduce inconsistent responses without this fix, totally cannot reproduce them with this fix. Please review. |
Thanks @mgtriffid . We will take a look. |
Closing this issue as the fix has been merged in. @mgtriffid will cut a release as soon as the current round of PRs are resolved. |
Environment:
Eureka 1.9.4
16 clients are talking to cluster of two Eureka nodes. 8 talk to one node, 8 to another.
Network is stable and fast.
What I do:
Using separate program, in endless loop I do the following:
What I expect:
Clients fetch delta periodically, and every time they do, hash string they calculate after updateDelta matches actual body of response for delta request
What I see instead:
From time to time (quite rarely) all clients talking to one of nodes show in logs lines like this:
The Reconcile hashcodes do not match, client : STARTING_7_UP_11_, server : STARTING_5_UP_13_. Getting the full registry
I was able to intercept the response for apps/delta (not for the log line above, at different moment), and noticed that numbers in hash string in that response were actually wrong. Hash string said there were only 9 instances in OUT_OF_SERVICE, but rest of the response clearly returned 10 instances in OUT_OF_SERVICE.
Have you guys ever noticed such weird behavior?
Thoughts:
We noticed that AbstractInstanceRegistry uses RRWL to make apps/delta response as accurate as possible. But looks like write lock is released only when we return object of type Applications to ResponseCacheImpl, and that object holds references to mutable instances of InstanceInfo. Is it possible that mutation occurs after we return Applications to method which actually calculates payload?
The text was updated successfully, but these errors were encountered: