Some responses for /apps and /apps/delta contain inconsistent data #1120

mgtriffid · 2018-09-04T09:43:40Z

Environment:
Eureka 1.9.4
16 clients are talking to cluster of two Eureka nodes. 8 talk to one node, 8 to another.
Network is stable and fast.
What I do:
Using separate program, in endless loop I do the following:

For each service set status to UP via REST API
For each service set status to STARTING via REST API
For each service set status to OUT_OF_SERVICE via REST API

What I expect:
Clients fetch delta periodically, and every time they do, hash string they calculate after updateDelta matches actual body of response for delta request
What I see instead:
From time to time (quite rarely) all clients talking to one of nodes show in logs lines like this:
The Reconcile hashcodes do not match, client : STARTING_7_UP_11_, server : STARTING_5_UP_13_. Getting the full registry
I was able to intercept the response for apps/delta (not for the log line above, at different moment), and noticed that numbers in hash string in that response were actually wrong. Hash string said there were only 9 instances in OUT_OF_SERVICE, but rest of the response clearly returned 10 instances in OUT_OF_SERVICE.

Have you guys ever noticed such weird behavior?

Thoughts:
We noticed that AbstractInstanceRegistry uses RRWL to make apps/delta response as accurate as possible. But looks like write lock is released only when we return object of type Applications to ResponseCacheImpl, and that object holds references to mutable instances of InstanceInfo. Is it possible that mutation occurs after we return Applications to method which actually calculates payload?

The text was updated successfully, but these errors were encountered:

mgtriffid · 2018-09-06T08:54:58Z

We consider invalid responses for apps/delta to be a bug, so would very appreciate some feedback :)

qiangdavidliu · 2018-09-10T03:57:55Z

Hi @mgtriffid thanks for the issue and the experiment! The algorithm to compute the full/delta hash code is pretty ancient and I would not be surprised if it has some false signals. In out internal set up (i.e. in the Java EurekaClient) the client just falls back to a full fetch (refresh global state) when it sees a hash mismatch, and this is acceptable to us as we don't look at each individual fetch request, but rather whether the data is consistent over (a relatively short period of) time.

Related, we also run a side app internally that continuously calculates the eventual data consistency between the servers.

mgtriffid · 2018-09-10T08:35:33Z

Hi @qiangdavidliu , thank you for your response!
We are also using Eureka's Java client, and we see that fallback to full fetch works just fine. However, the problem is, when Eureka's response contradicts itself (hash string doesn't match delta), then, because of Eureka response caching, many clients receive such response, and many clients fallback to full fetch, and that puts significant load on Eureka server. So I think we'll try to move serialization of /apps and /apps/delta under the write lock and see if it helps, and provide the PR.

Could you kindly tell more about that side app, if that doesn't violate your NDA? :)

qiangdavidliu · 2018-09-11T04:25:02Z

PRs are definitely welcome. Internally, we have partitioned our eureka usage to a read/write set that allows the read cluster to autoscale.

As for the side server that does data diffing, it is just a simple app that periodically gets data from each eureka write server and publish some metrics after comparing the data. We use it to alert on data inconsistencies. Nothing fancy.

pparth · 2018-09-26T08:26:41Z

@qiangdavidliu We have found out that when Eureka system gets into a "registry inconsistency mode" and many clients start executing full registry fetches, all these service instances exhibit high GC pauses for the specific time period, probably due to the downloading of full registries. This behaviour is definitely scary and we would like to know whether we operate at the limit of a Eureka system (at least one that is based on the open-sourced version) or not. We have around 1500 service instances registered in eureka now and the full registry fetch (along with the metadata we attach) is maybe 100MB of size. Do you feel that this may be a problematic setup? Do you have any information about scaling you can provide?

qiangdavidliu · 2018-09-26T16:58:18Z

Hi @pparth we run eureka at much larger scales than 1500 instances (order of magnitude). There are a few things we do differently than the defaults in open source:

we use G1GC which has given us much better GC characteristics compared to CMS
we actually run an autoscalable readonly tier in front of our fixed host write cluster that take all the read traffic.

mattnelson · 2018-09-26T17:27:17Z

2. we actually run an autoscalable readonly tier in front of our fixed host write cluster that take all the read traffic.

@qiangdavidliu Can you provide more details on this? Is this using Eureka2? Which is introducing separate read/write clusters which has never progressed past a release candidate 3+ years ago.

qiangdavidliu · 2018-09-27T04:46:19Z

Hi @mattnelson the readonly server is not related to the Eureka2 work on the 2.x branch. That effort is abandon for various reasons, one being dependencies on projects that were never fully realised.

We do something pretty simple internally by basically taking the same eureka REST resources, put it in front of a guava cache, which in turn fronts a regular eureka client. This app then just registers with eureka as a regular app. (this app source code is not open source, but it is pretty straight forward).

The eureka transport client config contains some configuration to support such a model.

For example, set the following configs:

# false == talk to readonly for query
eureka.transport.useBootstrapResolverForQuery=false

# the vipAddress for the readonly cluster
eureka.transport.readClusterVip={some vip you've defined for the read only servers}

# the vipAddress for the write cluster
eureka.transport.writeClusterVip={the vip address the write cluster register with}

pparth · 2018-09-27T06:29:33Z

@qiangdavidliu Are you using Eureka to store service instance metadata? Do you think that this is a bad practise?
If the size of your registry is orders of magnitude larger, how do you handle full registry fetches? When such fetches happen, they must download tons of data, no?

mgtriffid · 2018-10-01T12:39:24Z

Submitted PR #1126 for this. Was able to easily reproduce inconsistent responses without this fix, totally cannot reproduce them with this fix. Please review.

qiangdavidliu · 2018-10-02T01:10:04Z

Thanks @mgtriffid . We will take a look.

qiangdavidliu · 2018-10-20T22:33:56Z

Closing this issue as the fix has been merged in. @mgtriffid will cut a release as soon as the current round of PRs are resolved.

mgtriffid mentioned this issue Oct 1, 2018

EUREKA-1120 Move write.lock higher in stack... #1126

Merged

mattnelson mentioned this issue Oct 11, 2018

Why was Eureka 2.0 discontinued? #1129

Closed

qiangdavidliu closed this as completed Oct 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some responses for /apps and /apps/delta contain inconsistent data #1120

Some responses for /apps and /apps/delta contain inconsistent data #1120

mgtriffid commented Sep 4, 2018 •

edited

Loading

mgtriffid commented Sep 6, 2018

qiangdavidliu commented Sep 10, 2018 •

edited

Loading

mgtriffid commented Sep 10, 2018

qiangdavidliu commented Sep 11, 2018

pparth commented Sep 26, 2018

qiangdavidliu commented Sep 26, 2018

mattnelson commented Sep 26, 2018

qiangdavidliu commented Sep 27, 2018 •

edited

Loading

pparth commented Sep 27, 2018

mgtriffid commented Oct 1, 2018

qiangdavidliu commented Oct 2, 2018

qiangdavidliu commented Oct 20, 2018

Some responses for /apps and /apps/delta contain inconsistent data #1120

Some responses for /apps and /apps/delta contain inconsistent data #1120

Comments

mgtriffid commented Sep 4, 2018 • edited Loading

mgtriffid commented Sep 6, 2018

qiangdavidliu commented Sep 10, 2018 • edited Loading

mgtriffid commented Sep 10, 2018

qiangdavidliu commented Sep 11, 2018

pparth commented Sep 26, 2018

qiangdavidliu commented Sep 26, 2018

mattnelson commented Sep 26, 2018

qiangdavidliu commented Sep 27, 2018 • edited Loading

pparth commented Sep 27, 2018

mgtriffid commented Oct 1, 2018

qiangdavidliu commented Oct 2, 2018

qiangdavidliu commented Oct 20, 2018

mgtriffid commented Sep 4, 2018 •

edited

Loading

qiangdavidliu commented Sep 10, 2018 •

edited

Loading

qiangdavidliu commented Sep 27, 2018 •

edited

Loading