Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't report a cluster as ready to work until node connection protocol has completed #1771

Merged
merged 1 commit into from
Dec 6, 2017

Conversation

jtfmumm
Copy link
Contributor

@jtfmumm jtfmumm commented Dec 5, 2017

Previous to this commit, we were reporting ready to work at
a boundary as soon as the TCP connection was established.
Since we are not actually ready to work until all boundary
connect protocol messages have been received, this changes that.

Closes #1763

@slfritchie
Copy link
Contributor

There are two tests failing, test_restart_pony and test_restart_machida in testing/correctness/tests/restart_without_resilience/restart_without_resilience.py. I can duplicate the failures by running make integration-tests-testing-correctness-tests-restart_without_resilience verbose=true debug=true on my MacBook.

The last 5 lines of the initializer's output in each test are:

Muting TCPSource
Muting TCPSource
Muting DataChannel
The same Initializable reported being ready to work twice
This should never happen: failure in /Users/scott/s/src/wallaroo/lib/wallaroo/core/initialization/local_topology.pony at line 1637

The full stdout from the initializer process for both tests is at https://gist.github.com/slfritchie/a3dc74b579708ba2a68b50403b4145d3

@jtfmumm
Copy link
Contributor Author

jtfmumm commented Dec 5, 2017

Ok I think I understand that error. I'll fix that tomorrow morning. Thanks.

Previous to this commit, we were reporting ready to work at
a boundary as soon as the TCP connection was established.
Since we are not actually ready to work until all boundary
connect protocol messages have been received, this changes that.
@jtfmumm
Copy link
Contributor Author

jtfmumm commented Dec 6, 2017

@slfritchie That error was because I was overreaching in my initial attempt. I've scaled it back to just what we need and tests are passing now.

@jtfmumm
Copy link
Contributor Author

jtfmumm commented Dec 6, 2017

@SeanTAllen This makes it so that we now wait for the boundary connect protocol process to complete before we say we're ready to work.

@SeanTAllen SeanTAllen changed the title Wait for boundary connect protocol to report ready to work Don't report a cluster as ready to work until node connection protocol has completed Dec 6, 2017
@SeanTAllen
Copy link
Contributor

@jtfmumm i updated the title

@jtfmumm
Copy link
Contributor Author

jtfmumm commented Dec 6, 2017

@SeanTAllen Ok

Copy link
Contributor

@slfritchie slfritchie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix works well: it doesn't have any transient failures that I can see in a few hundred attempts. @SeanTAllen any additional review?

@SeanTAllen SeanTAllen merged commit 678a4ae into master Dec 6, 2017
wallaroolabs-wally added a commit that referenced this pull request Dec 6, 2017
@SeanTAllen
Copy link
Contributor

@slfritchie nope. i merged.

@SeanTAllen SeanTAllen deleted the connect-protocol branch December 6, 2017 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants