Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HEALTHCHECK to Linux Docker image. #128

Closed
wants to merge 3 commits into from

Conversation

emetsger
Copy link

@emetsger emetsger commented Apr 22, 2024

Summary

I ran across an issue using testcontainers in golang, specifically when testcontainers are used by multiple test packages in a single project. Go builds a test binary for each package and runs them simultaneously.

Example project tree, where test binaries for packages a, b, c, and d will run simultaneously, attempting to use a testcontainer:

.
├── a
│   └── a_test.go
├── b
│   └── b_test.go
├── c
│   └── c_test.go
├── d
│   └── d_test.go
├── go.mod
└── go.sum

One of the test binaries will win the race to create the ryuk container, and it becomes present in the list of docker containers returned by the Docker API.

The problem is that it the ryuk container can be exposed by the Docker API before the container is ready to accept connections. A testcontainer client can attempt to connect to the container before it is accepting connections.

This results in error messages tests like:

2024/04/21 22:25:10 🔥 Reaper obtained from Docker for this test session 99998b629eea57023f49aefc9eda5406c84ce4c91d0c6005c1b8c00dcbbce0d7
2024/04/21 22:25:10 failed to start container: dial tcp 127.0.0.1:32917: connect: connection refused: Connecting to Ryuk on localhost:32917 failed: connecting to reaper failed: failed to create container

Clone this test case repo and run go test -count=1 -v ./... and observe the errors.

Run go test -count=1 -p 1 -v ./... and tests should pass.

Prior issues

Race conditions (and the connection refused error message) have been the topic of previous issues, c.f.

But this PR addresses the situation described here by adding a simple healthcheck using netstat (already present in the alpine image).

Future Work

Once healthcheck information is exposed by the ryuk container, the lookupReaperContainer func can retry until the status is healthy.

I'm not on a Windows platform, and I'm not familiar with nanoserver and whether it comes with Powershell, and whether a command like netstat -an | Select-String 8080 will work. This PR only adds a HEALTHCHECK for the linux platform.

Unintended Consequences

Adding a healthcheck to ryuk may increase its perceived startup time.

Example Healthcheck Output

Before healthcheck (current state, i.e. main):

        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 25660,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2024-04-22T01:27:19.276032615Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },

After healthcheck (this PR):

        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 1703,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2024-04-22T01:46:01.038259025Z",
            "FinishedAt": "0001-01-01T00:00:00Z",
            "Health": {
                "Status": "healthy",
                "FailingStreak": 0,
                "Log": [
                    {
                        "Start": "2024-04-22T01:46:03.039126401Z",
                        "End": "2024-04-22T01:46:03.102456526Z",
                        "ExitCode": 0,
                        "Output": ""
                    },
                    {
                        "Start": "2024-04-22T01:46:05.107285902Z",
                        "End": "2024-04-22T01:46:05.151179319Z",
                        "ExitCode": 0,
                        "Output": ""
                    },
                    {
                        "Start": "2024-04-22T01:46:07.154951153Z",
                        "End": "2024-04-22T01:46:07.243256403Z",
                        "ExitCode": 0,
                        "Output": ""
                    },
                    {
                        "Start": "2024-04-22T01:46:09.247204821Z",
                        "End": "2024-04-22T01:46:09.319083404Z",
                        "ExitCode": 0,
                        "Output": ""
                    }
                ]
            }
        },

Copy link
Contributor

@HofmeisterAn HofmeisterAn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it the ryuk container can be exposed by the Docker API before the container is ready to accept connections. A testcontainer client can attempt to connect to the container before it is accepting connections.

Out of curiosity and to understand and compare the behavior (issue) with the .NET implementation: Does Go create a single Ryuk instance for all packages? Does one package connect successfully and the others fail? Does it retry to connect if it fails?

@@ -25,3 +25,4 @@ FROM ${BASE_IMAGE}
COPY --from=build /bin/ryuk /bin/ryuk
CMD ["/bin/ryuk"]
LABEL org.testcontainers.ryuk=true
HEALTHCHECK --start-period=5s --interval=2s --timeout=2s --retries=10 CMD netstat -ltn | grep -q ":8080"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think the CMD will work on Windows (nanoserver).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, yes it won't. Sorry- I was a bit sleep deprived making this PR. I don't have access to a windows system but I can do some research to find a simple command that works in nanoserver!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HofmeisterAn

Since I'm not on windows, and am not familiar with nanoserver, I thought it best to revert the changes on windows/Dockerfile. It would be best for someone with Windows experience to implement and test a Windows health check.

@emetsger
Copy link
Author

emetsger commented Apr 22, 2024

Does Go create a single Ryuk instance for all packages?

Yes, exactly.

Does one package connect successfully and the others fail? Does it retry to connect if it fails?

One package typically connects successfully. With the other packages, it's a race: if they get the container before it's ready to accept connections, they fail.

In my testing, all remaining packages fail. But I can imagine in a larger codebase, some packages would succeed.

How does .NET work: is it a single ryuk instance? For example, can you launch two tests in parallel and re-create the behavior (issue)?

@emetsger
Copy link
Author

With respect to the languages/platforms supported by testcontainers:

If other platforms are affected by this same issue, then they'll need a patch similar to testcontainers/testcontainers-go#2508, which actually uses the results of the healthcheck before returning a (healthy) ryuk container.

@HofmeisterAn
Copy link
Contributor

How does .NET work: is it a single ryuk instance? For example, can you launch two tests in parallel and re-create the behavior (issue)?

Thanks for the quick explanation. .NET behaves different here. It creates a Ryuk instance for each test assembly (test project). Test assemblies are kind of isolated from each other. We do not need to deal with synchronization here.

@emetsger
Copy link
Author

I did encounter some things that I can't explain during my investigation, so I wanted to add them here for full transparency.

TL;DR: I failed to identify which component was denying the connection.


I have an overly-simple mental model of how Docker API+Docker Engine work together with their containers, so these issues could just be my ignorance.

The error message indicates the connection was denied, not a timeout.

So who denied the connection? The Docker engine? It wasn't the ryuk container (see below).

I did not observe the denied TCP connection

When I was troubleshooting, I used tcpdump - on the docker vm itself - to obtain traffic destined for ryuk. I could see successful connections/traffic, but I never saw a three-way handshake attempt that was denied.

@emetsger emetsger changed the title Add HEALTHCHECK to Docker images. Add HEALTHCHECK to Linux Docker image. Apr 23, 2024
@mdelapenya
Copy link
Member

Out of curiosity and to understand and compare the behavior (issue) with the .NET implementation: Does Go create a single Ryuk instance for all packages? Does one package connect successfully and the others fail? Does it retry to connect if it fails?

hey @HofmeisterAn, yes, Go creates one Ryuk container per test session, which implies some synchronisation logic to check if the ryuk container is in the running state before creating a new one.

@emetsger for the repro repository, I cloned it and it works for me with both values of -p. I tried running it with Docker Desktop and Testcontainers Cloud, with the same result:

Using all processors

=== RUN   TestA
2024/05/09 07:32:17 github.com/testcontainers/testcontainers-go - Connected to docker: 
  Server Version: 78+testcontainerscloud (via Testcontainers Desktop 1.11.0)
  API Version: 1.43
  Operating System: Ubuntu 22.04.4 LTS
  Total Memory: 15779 MB
  Resolved Docker Host: tcp://127.0.0.1:62301
  Resolved Docker Socket Path: /var/run/docker.sock
  Test SessionID: af1b8c124505cd4b386767859edf7b742c07386a7d37352650522cfcb2f4e6de
  Test ProcessID: f234fbe5-4793-4cf1-923b-46edffa8668b
2024/05/09 07:32:17 🐳 Creating container for image testcontainers/ryuk:0.7.0
2024/05/09 07:32:17 ✅ Container created: 783b9d4e2406
2024/05/09 07:32:17 🐳 Starting container: 783b9d4e2406
2024/05/09 07:32:18 ✅ Container started: 783b9d4e2406
2024/05/09 07:32:18 🚧 Waiting for container id 783b9d4e2406 image: testcontainers/ryuk:0.7.0. Waiting for: &{Port:8080/tcp timeout:<nil> PollInterval:100ms}
2024/05/09 07:32:18 🔔 Container is ready: 783b9d4e2406
2024/05/09 07:32:18 🐳 Creating container for image docker.io/postgres:16-alpine
2024/05/09 07:32:18 ✅ Container created: e0cb0246b173
2024/05/09 07:32:18 🐳 Starting container: e0cb0246b173
2024/05/09 07:32:19 ✅ Container started: e0cb0246b173
2024/05/09 07:32:19 🚧 Waiting for container id e0cb0246b173 image: docker.io/postgres:16-alpine. Waiting for: &{timeout:<nil> deadline:0x140003615b8 Strategies:[0x1400037d9b0]}
2024/05/09 07:32:21 🔔 Container is ready: e0cb0246b173
2024/05/09 07:32:21 🐳 Terminating container: e0cb0246b173
2024/05/09 07:32:21 🚫 Container terminated: e0cb0246b173
--- PASS: TestA (4.10s)
PASS
ok  	tc-debug/a	4.281s
=== RUN   TestB
2024/05/09 07:32:17 github.com/testcontainers/testcontainers-go - Connected to docker: 
  Server Version: 78+testcontainerscloud (via Testcontainers Desktop 1.11.0)
  API Version: 1.43
  Operating System: Ubuntu 22.04.4 LTS
  Total Memory: 15779 MB
  Resolved Docker Host: tcp://127.0.0.1:62301
  Resolved Docker Socket Path: /var/run/docker.sock
  Test SessionID: af1b8c124505cd4b386767859edf7b742c07386a7d37352650522cfcb2f4e6de
  Test ProcessID: 947a4325-eb63-4695-840c-87c38885c7ba
2024/05/09 07:32:18 🔥 Reaper obtained from Docker for this test session 783b9d4e24063bf41b22795afe79f38754769d21c5852532f25708dc849f72ea
2024/05/09 07:32:18 🐳 Creating container for image docker.io/postgres:16-alpine
2024/05/09 07:32:18 ✅ Container created: 8dba32be8360
2024/05/09 07:32:18 🐳 Starting container: 8dba32be8360
2024/05/09 07:32:18 ✅ Container started: 8dba32be8360
2024/05/09 07:32:18 🚧 Waiting for container id 8dba32be8360 image: docker.io/postgres:16-alpine. Waiting for: &{timeout:<nil> deadline:0x140003d15d0 Strategies:[0x140003ed9e0]}
2024/05/09 07:32:21 🔔 Container is ready: 8dba32be8360
2024/05/09 07:32:21 🐳 Terminating container: 8dba32be8360
2024/05/09 07:32:21 🚫 Container terminated: 8dba32be8360
--- PASS: TestB (3.79s)
PASS
ok  	tc-debug/b	4.237s
=== RUN   TestC
2024/05/09 07:32:17 github.com/testcontainers/testcontainers-go - Connected to docker: 
  Server Version: 78+testcontainerscloud (via Testcontainers Desktop 1.11.0)
  API Version: 1.43
  Operating System: Ubuntu 22.04.4 LTS
  Total Memory: 15779 MB
  Resolved Docker Host: tcp://127.0.0.1:62301
  Resolved Docker Socket Path: /var/run/docker.sock
  Test SessionID: af1b8c124505cd4b386767859edf7b742c07386a7d37352650522cfcb2f4e6de
  Test ProcessID: f38b47e6-f595-404a-8cb0-2534fce5916a
2024/05/09 07:32:18 🔥 Reaper obtained from Docker for this test session 783b9d4e24063bf41b22795afe79f38754769d21c5852532f25708dc849f72ea
2024/05/09 07:32:18 🐳 Creating container for image docker.io/postgres:16-alpine
2024/05/09 07:32:18 ✅ Container created: 614b2727db5c
2024/05/09 07:32:18 🐳 Starting container: 614b2727db5c
2024/05/09 07:32:18 ✅ Container started: 614b2727db5c
2024/05/09 07:32:18 🚧 Waiting for container id 614b2727db5c image: docker.io/postgres:16-alpine. Waiting for: &{timeout:<nil> deadline:0x14000393698 Strategies:[0x140003b39b0]}
2024/05/09 07:32:21 🔔 Container is ready: 614b2727db5c
2024/05/09 07:32:21 🐳 Terminating container: 614b2727db5c
2024/05/09 07:32:21 🚫 Container terminated: 614b2727db5c
--- PASS: TestC (3.88s)
PASS
ok  	tc-debug/c	4.201s
=== RUN   TestD
2024/05/09 07:32:18 github.com/testcontainers/testcontainers-go - Connected to docker: 
  Server Version: 78+testcontainerscloud (via Testcontainers Desktop 1.11.0)
  API Version: 1.43
  Operating System: Ubuntu 22.04.4 LTS
  Total Memory: 15779 MB
  Resolved Docker Host: tcp://127.0.0.1:62301
  Resolved Docker Socket Path: /var/run/docker.sock
  Test SessionID: af1b8c124505cd4b386767859edf7b742c07386a7d37352650522cfcb2f4e6de
  Test ProcessID: 5d324800-b68a-4687-90d1-b1216078c8f9
2024/05/09 07:32:18 🔥 Reaper obtained from Docker for this test session 783b9d4e24063bf41b22795afe79f38754769d21c5852532f25708dc849f72ea
2024/05/09 07:32:18 🐳 Creating container for image docker.io/postgres:16-alpine
2024/05/09 07:32:18 ✅ Container created: a533d8ad9249
2024/05/09 07:32:18 🐳 Starting container: a533d8ad9249
2024/05/09 07:32:18 ✅ Container started: a533d8ad9249
2024/05/09 07:32:18 🚧 Waiting for container id a533d8ad9249 image: docker.io/postgres:16-alpine. Waiting for: &{timeout:<nil> deadline:0x140003f95d0 Strategies:[0x140004659e0]}
2024/05/09 07:32:21 🔔 Container is ready: a533d8ad9249
2024/05/09 07:32:21 🐳 Terminating container: a533d8ad9249
2024/05/09 07:32:21 🚫 Container terminated: a533d8ad9249
--- PASS: TestD (3.83s)
PASS
ok  	tc-debug/d	4.425s

@lachauj
Copy link

lachauj commented May 17, 2024

for the repro repository, I cloned it and it works for me with both values of -p. I tried running it with Docker Desktop and Testcontainers Cloud, with the same result:

@mdelapenya I have the same result with testcontainers-go and Docker Desktop, but when I run Docker with Colima I randomly have the issue when TestContainers wants to connect to Ryuk (95% of the time it triggers the issue)

@mdelapenya
Copy link
Member

Pinging @eddumelendez and @kiview as this is a cross-lang feature.

@kiview
Copy link
Member

kiview commented May 21, 2024

Can we decouple the switch to using a HEALTHCHECK from fixing the race condition? While I am not against using a HEALTHCHECK here per-se, I would prefer to first fix the race condition (this should be possible with the current waiting logic, right?). Also, I am fairly certain, that HEALTHCHECK is not supported by all potential container runtimes (e.g. Podman). While we generally aim for first-class support of Docker, adding a substantial regression for alternative container runtimes might send a bad signal to the community, especially if it can be avoided. And last, we should benchmark the performance impact of switching to HEALTCHECK as well (if we decide to use it).

@HofmeisterAn I guess we didn't fully sync up on the current approach to test session semantics (and technical measures to achieve them) across languages, back when we introduced this in Go (and AFAIK in Node as well). From a technical PoV, Go and .NET should be in a similar situation.

@mdelapenya
Copy link
Member

Closing as per testcontainers/testcontainers-go#2508 (comment)

Thanks!

@mdelapenya mdelapenya closed this Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants