Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sd/eureka: tests are incredibly flaky on CI #545

Closed
peterbourgon opened this issue Jun 6, 2017 · 18 comments
Closed

sd/eureka: tests are incredibly flaky on CI #545

peterbourgon opened this issue Jun 6, 2017 · 18 comments
Labels

Comments

@peterbourgon
Copy link
Member

Cmd-F for FAIL: on each of these:

I can't reproduce the failures locally. I suspect it is a nondeterminsm/timing issue related to the underlying Fargo library. It started happening after we merged the new SD PR, but I'm not sure that was the cause so much as the trigger. Could someone with more domain knowledge take a look?

@yurishkuro @martinbaillie @seh

@yurishkuro
Copy link
Contributor

I'll take a look. Interestingly, it's the unit test that fails, not integration tests, but the latter could use some improvement as they use unconditional sleep().

@seh
Copy link
Contributor

seh commented Jun 6, 2017

I finally got CircleCI to display the failure, so I'm looking into it now too.

@seh
Copy link
Contributor

seh commented Jun 6, 2017

The most mystifying part I've found so far is why this unit test expects only one instance to be present. It looks to me like expecting two is more reasonable, given that the application has two instances.

@seh
Copy link
Contributor

seh commented Jun 6, 2017

@yurishkuro, it looks like the one is a holdover in commit 432e292. The test could pass mistakenly if it retrieves the cache's state before the goroutine feeding the updates to the cache gets a chance to run for the first time. That's why there's an "await" parameter for fargo.ScheduleAppUpdates: to allow for coordination like this.

@yurishkuro
Copy link
Contributor

@seh unless there is an error retrieving the instances in NewInstancer(), the state should be initialized before NewInstancer returns

@yurishkuro
Copy link
Contributor

But the following update could add 2 instances instead of one given to connection at construction, and that's certainly a race condition, the instancer could get either of those two states

@seh
Copy link
Contributor

seh commented Jun 6, 2017

eureka.NewInstancer starts a goroutine here. That goroutine may not get a chance to run before NewInstancer completes and the test goes on to probe the cache.

@seh
Copy link
Contributor

seh commented Jun 6, 2017

I could see adding a means to eureka.testConnection to wait until its first update has been delivered via (*testConnection).ScheduleAppUpdates. We'd have to add a mutex and boolean field.

@seh
Copy link
Contributor

seh commented Jun 6, 2017

If you're amenable, I can submit a proposal, but I know that this is mostly your code, so I understand if you'd like to fix it yourself. Please let me know how to proceed.

@yurishkuro
Copy link
Contributor

Yeah. Strictly speaking the main test should be checkin two conditions, that the Instancer gets the initial state when being constructed, and that it gets an update once something changes in the discovery.

@yurishkuro
Copy link
Contributor

Please, feel free to submit a PR. It's not really my code, I only adapted it to the change in the discovery API.

@seh
Copy link
Contributor

seh commented Jun 6, 2017

Can we change the signature of eureka.NewInstancer, or must we keep it as is to retain backward compatibility?

I ask because I can in a few things to allow the tests to work, but I suspect that callers of eureka.NewInstancer might also want a way to wait until at least one update has arrived before proceeding.

@yurishkuro
Copy link
Contributor

I am not sure it makes sense to me to have that kind of control for the constructor. Under normal circumstances the discovery system is already running somewhere by itself, and you have no guarantee as to when it would send an update. That's why the constructor synchronously calls getInstances(), so once the Instancer is created it has the current state of the discovery system.

@seh
Copy link
Contributor

seh commented Jun 6, 2017

The test eureka.TestInstancerScheduleUpdates is also racy, but it gets by due to sleeping for 50 milliseconds being enough, apparently, to first beat the goroutine updating the cache and then fall in behind its first update.

@seh
Copy link
Contributor

seh commented Jun 6, 2017

From the caller's perspective, he can't tell the difference between these event sequences:

  • Slow start
    • NewInstancer retrieves instances synchronously
    • NewInstancer returns
    • NewInstancer's goroutines retrieves more instances
  • Fast start
    • NewInstancer retrieves instances synchronously
    • NewInstancer's goroutines retrieves more instances
    • NewInstancer returns

Right now, one unit test assumes the former the behavior, and another the latter behavior.

The initial synchronous request doesn't need to be there, because you get the same effect by waiting for the first update to arrive from fargo.ScheduleAppUpdates. It just occurs on a different goroutine. I just figured that I could give callers some control over that delay.

@yurishkuro
Copy link
Contributor

But can't we instead control the sequence of events in the mock connection?

Alternatively, the NewInstancer() constructor could be changed to not call getInstances and instead execute conn.ScheduleAppUpdates() right there (instead of a separate go-routing) as long as there is a way to tell fargo to do the first update immediately, not after a delay. If we go in this direction, then I don't think NewInstancer() needs any additional control, it should certainly block until the first update is received to initialize the cache.

@seh
Copy link
Contributor

seh commented Jun 6, 2017

I started implementing your latter proposal (passing true for fargo.ScheduleAppUpdates's "await" parameter), got it wrong once already, and am considering how to revise it.

@seh
Copy link
Contributor

seh commented Jun 7, 2017

Please see #547 for a proposed fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants