Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some responses for /apps and /apps/delta contain inconsistent data #1120

Closed
mgtriffid opened this issue Sep 4, 2018 · 12 comments
Closed

Some responses for /apps and /apps/delta contain inconsistent data #1120

mgtriffid opened this issue Sep 4, 2018 · 12 comments

Comments

@mgtriffid
Copy link
Contributor

mgtriffid commented Sep 4, 2018

Environment:
Eureka 1.9.4
16 clients are talking to cluster of two Eureka nodes. 8 talk to one node, 8 to another.
Network is stable and fast.
What I do:
Using separate program, in endless loop I do the following:

  1. For each service set status to UP via REST API
  2. For each service set status to STARTING via REST API
  3. For each service set status to OUT_OF_SERVICE via REST API

What I expect:
Clients fetch delta periodically, and every time they do, hash string they calculate after updateDelta matches actual body of response for delta request
What I see instead:
From time to time (quite rarely) all clients talking to one of nodes show in logs lines like this:
The Reconcile hashcodes do not match, client : STARTING_7_UP_11_, server : STARTING_5_UP_13_. Getting the full registry
I was able to intercept the response for apps/delta (not for the log line above, at different moment), and noticed that numbers in hash string in that response were actually wrong. Hash string said there were only 9 instances in OUT_OF_SERVICE, but rest of the response clearly returned 10 instances in OUT_OF_SERVICE.

Have you guys ever noticed such weird behavior?

Thoughts:
We noticed that AbstractInstanceRegistry uses RRWL to make apps/delta response as accurate as possible. But looks like write lock is released only when we return object of type Applications to ResponseCacheImpl, and that object holds references to mutable instances of InstanceInfo. Is it possible that mutation occurs after we return Applications to method which actually calculates payload?

@mgtriffid
Copy link
Contributor Author

We consider invalid responses for apps/delta to be a bug, so would very appreciate some feedback :)

@qiangdavidliu
Copy link
Contributor

qiangdavidliu commented Sep 10, 2018

Hi @mgtriffid thanks for the issue and the experiment! The algorithm to compute the full/delta hash code is pretty ancient and I would not be surprised if it has some false signals. In out internal set up (i.e. in the Java EurekaClient) the client just falls back to a full fetch (refresh global state) when it sees a hash mismatch, and this is acceptable to us as we don't look at each individual fetch request, but rather whether the data is consistent over (a relatively short period of) time.

Related, we also run a side app internally that continuously calculates the eventual data consistency between the servers.

@mgtriffid
Copy link
Contributor Author

Hi @qiangdavidliu , thank you for your response!
We are also using Eureka's Java client, and we see that fallback to full fetch works just fine. However, the problem is, when Eureka's response contradicts itself (hash string doesn't match delta), then, because of Eureka response caching, many clients receive such response, and many clients fallback to full fetch, and that puts significant load on Eureka server. So I think we'll try to move serialization of /apps and /apps/delta under the write lock and see if it helps, and provide the PR.

Could you kindly tell more about that side app, if that doesn't violate your NDA? :)

@qiangdavidliu
Copy link
Contributor

PRs are definitely welcome. Internally, we have partitioned our eureka usage to a read/write set that allows the read cluster to autoscale.

As for the side server that does data diffing, it is just a simple app that periodically gets data from each eureka write server and publish some metrics after comparing the data. We use it to alert on data inconsistencies. Nothing fancy.

@pparth
Copy link

pparth commented Sep 26, 2018

@qiangdavidliu We have found out that when Eureka system gets into a "registry inconsistency mode" and many clients start executing full registry fetches, all these service instances exhibit high GC pauses for the specific time period, probably due to the downloading of full registries. This behaviour is definitely scary and we would like to know whether we operate at the limit of a Eureka system (at least one that is based on the open-sourced version) or not. We have around 1500 service instances registered in eureka now and the full registry fetch (along with the metadata we attach) is maybe 100MB of size. Do you feel that this may be a problematic setup? Do you have any information about scaling you can provide?

@qiangdavidliu
Copy link
Contributor

Hi @pparth we run eureka at much larger scales than 1500 instances (order of magnitude). There are a few things we do differently than the defaults in open source:

  1. we use G1GC which has given us much better GC characteristics compared to CMS
  2. we actually run an autoscalable readonly tier in front of our fixed host write cluster that take all the read traffic.

@mattnelson
Copy link
Contributor

2. we actually run an autoscalable readonly tier in front of our fixed host write cluster that take all the read traffic.

@qiangdavidliu Can you provide more details on this? Is this using Eureka2? Which is introducing separate read/write clusters which has never progressed past a release candidate 3+ years ago.

@qiangdavidliu
Copy link
Contributor

qiangdavidliu commented Sep 27, 2018

Hi @mattnelson the readonly server is not related to the Eureka2 work on the 2.x branch. That effort is abandon for various reasons, one being dependencies on projects that were never fully realised.

We do something pretty simple internally by basically taking the same eureka REST resources, put it in front of a guava cache, which in turn fronts a regular eureka client. This app then just registers with eureka as a regular app. (this app source code is not open source, but it is pretty straight forward).

The eureka transport client config contains some configuration to support such a model.

For example, set the following configs:

# false == talk to readonly for query
eureka.transport.useBootstrapResolverForQuery=false

# the vipAddress for the readonly cluster
eureka.transport.readClusterVip={some vip you've defined for the read only servers}

# the vipAddress for the write cluster
eureka.transport.writeClusterVip={the vip address the write cluster register with}

@pparth
Copy link

pparth commented Sep 27, 2018

@qiangdavidliu Are you using Eureka to store service instance metadata? Do you think that this is a bad practise?
If the size of your registry is orders of magnitude larger, how do you handle full registry fetches? When such fetches happen, they must download tons of data, no?

@mgtriffid
Copy link
Contributor Author

Submitted PR #1126 for this. Was able to easily reproduce inconsistent responses without this fix, totally cannot reproduce them with this fix. Please review.

@qiangdavidliu
Copy link
Contributor

Thanks @mgtriffid . We will take a look.

@qiangdavidliu
Copy link
Contributor

Closing this issue as the fix has been merged in. @mgtriffid will cut a release as soon as the current round of PRs are resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants