Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait robustness #504

Merged
merged 12 commits into from
Dec 10, 2017
Merged

Wait robustness #504

merged 12 commits into from
Dec 10, 2017

Conversation

rnorth
Copy link
Member

@rnorth rnorth commented Nov 26, 2017

We've been noticing some occasional random test failures particularly in the Selenium container test suites, in addition to troubling log messages where the VNC recording container has had to be restarted (within its retry budget of 3), and possible corrupt/cut-off video files (#466). The latter problem looks to be a race condition where Selenium started listening OK, but the VNC server was not always available yet.

So, I've done some refactoring. In brief:

  • We now allow a list of multiple startup liveness check ports to be defined for a container. This helps, e.g. for the browser containers, and lets us wait for both Selenium and VNC ports to be listening. I think this will help eliminate random flapping tests in this area.

  • Also, we now have a WaitAllStrategy that lets more than one wait strategy be used. Again for browser containers, we now wait for (a) a log message, and (b) the listening ports to be available.

  • For cases where we check running state from within the container, I've added one additional command that can identify listening ports.

  • I've broken out some aspects of the wait strategies/port detection into separate classes and used this to help improve test coverage.

@@ -337,6 +342,32 @@ protected Integer getLivenessCheckPort() {
}
}

/**
* @return the port on which to check if the container is ready
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be plural after this change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/**
* @return the port on which to check if the container is ready
*/
protected List<Integer> getLivenessCheckPorts() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the implementation uses Set as a type of collection, why don't we return Set here as well? Would make more sense, also List indicates the order

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

.filter(container -> container.getNames()[0].endsWith(linkableContainer.getContainerName()))
.map(container -> new Link(container.getNames()[0], alias))
Set<Link> links = dockerClient.listContainersCmd()
.withStatusFilter("running")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes - forgot to mention this in the summary. It seems that we had a redundant step here. While we're planning to kill off links, it seemed like a good fix to make all the same.

if (null == port) {
log.debug("Liveness check port of {} is empty. Not waiting.", container.getContainerName());
final List<Integer> externalLivenessCheckPorts = getLivenessCheckPorts();
if (null == externalLivenessCheckPorts || externalLivenessCheckPorts.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since getLivenessCheckPorts returns a collection, it would be nice to mark it as @NotNull to avoid such checks and make the contract more strict (nullable collection doesn't make a lot of sense anyway :) )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!


private void tryPort(Integer internalPort) {
String[][] commands = {
{"/bin/sh", "-c", format("cat /proc/net/tcp | awk '{print $2}' | grep :%x && echo %s", internalPort, SUCCESS_MARKER)},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


@Before
public void setUp() {
nginx = new GenericContainer<>("nginx:1.9.4");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not as a rule? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I thought this avoided a problem but it doesn't!

@@ -149,9 +149,6 @@ protected void optionallyMapResourceParameterAsVolume(@NotNull String paramName,
}
}

@Override
protected abstract Integer getLivenessCheckPort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

breaks the binary compatibility

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method, with the same signature, is in GenericContainer, so I was actually thinking that overriding it here makes no sense. (TBH I didn't release you could override a concrete method with an abstract one, so this gives me headaches 🤣 ). Happy to withdraw if you can see the problem more clearly than I can though.

@rnorth
Copy link
Member Author

rnorth commented Nov 26, 2017

Hmm, based on latest tests on Travis this hasn't improved test stability on CI (actually it's worsened, as I removed the 3x retries for starting VNC containers...).

It looks like we have the original issue of the VNC recorder connecting too soon, with a Connection refused error. So, I'm wondering if I need to dig into how the latest Docker on linux works and whether it might be doing the same proxying that led us to do the in-container port checks.

Thinking out loud, maybe we should just do the port checks in-container all the time...

@rnorth
Copy link
Member Author

rnorth commented Nov 26, 2017

Just to add, I'm running this apparently flaky test locally 100x (on a Mac) to see if there is any flakiness here.

@rnorth
Copy link
Member Author

rnorth commented Nov 27, 2017

I left a single selenium test looping overnight in two different modes:

  • Docker for Mac (internal check): 100% success rate
  • On linux (inside a container with docker socket mounted): ~20% success rate (bear in mind that this is with VNC recorder retries set to zero)

The Travis build on this branch was also not healthy.

So, I suspect that our external check is not good enough, even when run from a linux host. It's not a userland proxy, but I'm wondering if there's still a race (I'm not sure of the mechanics of the iptables routing setup between containers).

The latest commit, 9c716a8, makes a change that seems to be solid: Regardless of environment, always run the internal check, followed by the external check. So far, this works 100% of the time in both the modes I mentioned above.

@kiview
Copy link
Member

kiview commented Nov 29, 2017

Also keep in mind, that there might be some bugs/race conditions in our LogMessageWaitStrategy:
#455

Did you notice any problems with the LogMessageWaitStrategyfor the Selenium containers?

@rnorth
Copy link
Member Author

rnorth commented Nov 29, 2017 via email

@kiview
Copy link
Member

kiview commented Nov 30, 2017

We've monitored this behavior on Linux (Ubuntu and Fedora) as well, so it does not seem to be Windows specific.

"wait for all sequentially" is a nice pragmatic solution, a shame that we would overshadow the underlying problem.

@rnorth rnorth force-pushed the wait-robustness branch 2 times, most recently from 496096b to 9c716a8 Compare December 7, 2017 22:30
@rnorth rnorth changed the title WIP: Wait robustness Wait robustness Dec 7, 2017
protected Set<Integer> getLivenessCheckPorts() {
final Set<Integer> result = new HashSet<>();
if (exposedPorts.size() > 0) {
result.addAll(exposedPorts.stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put these blocks into private methods? Something like getExposedPortsAsInt and getBoundPortsAsInt.

Set<Link> links = dockerClient.listContainersCmd().exec().stream()
.filter(container -> container.getNames()[0].endsWith(linkableContainer.getContainerName()))
.map(container -> new Link(container.getNames()[0], alias))
Set<Link> links = dockerClient.listContainersCmd()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like this in a private method as well (since I really have to look into the stream to get whats happening here). Maybe findLinksForX (whatever X is)?


if (shouldCheckWithCommand()) {
List<Integer> exposedPorts = container.getExposedPorts();
final Set<Integer> internalPorts = exposedPorts.stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I would prefer to see a private method, getInternalPorts(). Sorry for being so picky about private methods, that's a coding style I prefer, you don't have to follow if you aren't a fan of it, just think it makes it easier to get a high level understanding of what's happening, especially for people more unfamiliar with the codebase 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's a good point so thanks for bringing it up (all three!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only we had Extensions...!

@kiview
Copy link
Member

kiview commented Dec 8, 2017

@rnorth I just realized in order to fix #455 we'd need something like a WaitForAnyWaitStrategy composite.

@rnorth
Copy link
Member Author

rnorth commented Dec 9, 2017

@kiview

I just realized in order to fix #455 we'd need something like a WaitForAnyWaitStrategy composite.

yes, we would. Would it be OK with you if we tackle #455 separately, though?

@kiview
Copy link
Member

kiview commented Dec 9, 2017

@rnorth Of course, clearly a separate feature and concern (the formatting in your last post is off btw. ;) ).

@rnorth rnorth merged commit f1ce2b1 into master Dec 10, 2017
@rnorth rnorth deleted the wait-robustness branch December 10, 2017 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants