Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race condition during blocking query? #300

Closed
imkira opened this issue Aug 22, 2014 · 9 comments
Closed

race condition during blocking query? #300

imkira opened this issue Aug 22, 2014 · 9 comments

Comments

@imkira
Copy link
Contributor

imkira commented Aug 22, 2014

I am running a cluster but I am able to reproduce (using just a single node) what appears to be a race condition during a blocking query.

the setup

So, in that node I run a consul agent in client mode, and in the same node I run 2 processes:

  • process1
    • last_index = 0
    • call http://127.0.0.1:8500/v1/health/service/myservice?consistent=&index=<last_index>&wait=1ms
    • if response.index != last_index { last_index = response.index; process(response) }
    • sleep 1ms
    • repeat
  • process2
    • call /v1/agent/service/register
    • repeat until the app receives a signal to quit
      • call /v1/agent/check/pass/
      • sleep 1000ms
    • call /v1/agent/service/deregister/

First I run process1 and leave it running. I never stop it.
Then, I run process2. Then I kill it (gracefully by sending it a signal to quit). Then run it again. I repeat this process until I reproduce the bug which happens after a few runs (few < 10).

the bug

It happens on process1.
Sometimes index doesn't change between responses (yes this is expected) but the response's Checks structure changes. I am observing 2 changes that I get in the response's Checks, specifically the addition of a service check (with the same checkID I am calling on process2 and status="unknown") and the second change being status changing to "passing". I am logging all responses I get from the API so I was able to notice this. Even after these 2 changes, there are times where response.index does not change.

Is this intended? Is this some race condition?
Thank you in advance and sorry for the long problem report.

notes

  • Note1: If I remove "if response.index != last_index" all works perfectly (but since process(response) is a heavy task I don't want to do it).
  • Note2: I am using very tiny wait and sleep times in process1 because I am stress testing the system and checking for any inconsistencies.
  • Note3: Until I reproduce the bug, I kill process2 after up to 2 or 3 seconds have elapsed.
  • Note4: I never call two or more process2 simultaneously and I never skip service/register nor service/deregister APIs.
  • Note5: Does this has anything to do with UI - Number of passing checks is off. #251 ?
@imkira
Copy link
Contributor Author

imkira commented Aug 22, 2014

I just noticed the bug is reproduced even more easily if I set longer wait times (like 5 seconds).

@armon
Copy link
Member

armon commented Aug 22, 2014

This may be a dup of #279. There does seem to be some race with the index calculation it seems.

@armon
Copy link
Member

armon commented Aug 22, 2014

Can you provide any scripts you use to trigger this?

@armon
Copy link
Member

armon commented Aug 22, 2014

Fixed by 3330956!

@armon armon closed this as completed Aug 22, 2014
@imkira
Copy link
Contributor Author

imkira commented Aug 25, 2014

Hey @armon thank you for the prompt support!
I got the latest source and ran consul but it doesn't seem to address the bug I am reporting.

Sorry for not putting it together as a test.go file, but here is a gist that reproduces the bug.

https://gist.github.com/imkira/f40e64d00583160c0c9f

As initially reported, I get two bug patterns:

  • len(Checks) between calls to health/service differs (0 vs 1, 1 vs 2, ...)
  • number of Checks matches but newly registered service status doesn't (unknown vs passing)

Let me know if you can fix this.
Thank you in advance.

@armon
Copy link
Member

armon commented Aug 25, 2014

@imkira I am unable to reproduce the error with you test code running against master. Can you verify you re-built against master? Also a gist of the error output could be useful.

@imkira
Copy link
Contributor Author

imkira commented Aug 26, 2014

Yes, I am running the latest version of consul.

~GOPATH/src/github.com/hashicorp/consul git(master) $ git log | head
commit e08e66367c20e829afbcde1f0210d3fe82ece1a7
Author: Armon Dadgar <armon.dadgar@gmail.com>
Date:   Fri Aug 22 15:54:48 2014 -0700

    website: Document node lifecycle. Fixes #282.

commit e0a815772a08148a7b1e89fd183c00ea9b986f14
Author: Armon Dadgar <armon.dadgar@gmail.com>
Date:   Fri Aug 22 15:08:15 2014 -0700

~GOPATH/src/github.com/hashicorp/consul git(master) $ rm $GOPATH/bin/consul
~GOPATH/src/github.com/hashicorp/consul git(master) $ go install
~GOPATH/src/github.com/hashicorp/consul git(master) $ go build -o consul2

I built the app as shown above and I ran it with $GOPATH/bin/consul -data-dir /var/consul/tmp -ui-dir /var/consul/ui as well as with ./consul2 agent -data-dir /var/consul/tmp -ui-dir /var/consul/ui in order to eliminate any mistake of mine of running an old version of consul.
The bug still manifests itself and without a change in results.
There is one thing I forgot to say. Before I ran the the gist, I joined this consul with a remote consul running in server mode. So yes, the local agent is running in client mode.
Would that cause an inconsistency?

As for the output, I ran the gist program 10 times and posted the results as a comment to the gist

https://gist.github.com/imkira/f40e64d00583160c0c9f

Let me know of any other way I can help you track down this bug.

@imkira
Copy link
Contributor Author

imkira commented Aug 26, 2014

Oh my oh my! My bad!
The devil was in the details. The node I was joining to was using 0.3.1 so yes, it was not fixed!
When I ran the local agent in server mode and stopped joining, the bug you said did not manifest at all!
I believe this is fixed and that it is not just because the local agent is not forwarding the request to another node in the cluster, though I am eager to reconfirm these results again as soon as the next release gets out (for the peace of mind :). I will let you know again when I do it!
Thank you a lot and sorry for the confusion.

@armon
Copy link
Member

armon commented Aug 26, 2014

No worries! Glad it is working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants