Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Ryuk might shutdown reused Container while connected to it #2445

Open
mdonkers opened this issue Mar 21, 2024 · 14 comments
Open

[Bug]: Ryuk might shutdown reused Container while connected to it #2445

mdonkers opened this issue Mar 21, 2024 · 14 comments
Labels
bug An issue with the library

Comments

@mdonkers
Copy link
Contributor

mdonkers commented Mar 21, 2024

Testcontainers version

0.29.1

Using the latest Testcontainers version?

Yes

Host OS

Linux

Host arch

x86

Go version

1.22

Docker version

unrelated

Docker info

unrelated

What happened?

As I was improving some integration tests in our own project, I sometimes noticed failures after we switched to reusing containers. Since I was improving the run-time of the tests, I was executing ITs several times after each other to make sure code compilation was not included in the time (while go clean -testcache && make integration-test; do :; done. I had a suspicion that me quickly running tests after each other was related to the failures so I did some further investigation.

From testcontainers output I saw several times that more than 1 container was created (while Reuse: true):

2024/03/21 23:10:58 🐳 Creating container for image testcontainers/ryuk:0.6.0
2024/03/21 23:10:58 ✅ Container created: efa32a26eb2e
2024/03/21 23:10:58 🐳 Starting container: efa32a26eb2e
2024/03/21 23:10:59 ✅ Container started: efa32a26eb2e
2024/03/21 23:10:59 🚧 Waiting for container id efa32a26eb2e image: testcontainers/ryuk:0.6.0. Waiting for: &{Port:8080/tcp timeout:<nil> PollInterval:100ms}
2024/03/21 23:10:59 🔔 Container is ready: efa32a26eb2e
2024/03/21 23:10:59 ✅ Container started: 0c8e757faff1
2024/03/21 23:10:59 🚧 Waiting for container id 0c8e757faff1 image: clickhouse/clickhouse-server:24.2-alpine. Waiting for: &{timeout:0xc000799058 URL:0x13356e0 Driver:clickhouse Port:9000/tcp startupTimeout:60000
000000 PollInterval:100ms query:SELECT 1}                                                                                                                                                                           2024/03/21 23:10:59 🔔 Container is ready: 0c8e757faff1
2024/03/21 23:10:59 ✅ Container started: 0c8e757faff1
2024/03/21 23:10:59 🚧 Waiting for container id 0c8e757faff1 image: clickhouse/clickhouse-server:24.2-alpine. Waiting for: &{timeout:0xc0005849e8 URL:0x13356e0 Driver:clickhouse Port:9000/tcp startupTimeout:60000
000000 PollInterval:100ms query:SELECT 1}
...
{"status":"error","errorType":"bad_data","error":"dial tcp [::1]:32918: connect: connection refused"}
...
2024/03/21 23:11:07 🐳 Creating container for image clickhouse/clickhouse-server:24.2-alpine
2024/03/21 23:11:07 🚧 Waiting for container id 7e037d775014 image: clickhouse/clickhouse-server:24.2-alpine. Waiting for: &{timeout:0xc0046bf638 URL:0x13356e0 Driver:clickhouse Port:9000/tcp startupTimeout:60000
000000 PollInterval:100ms query:SELECT 1}                                                                                                                                                                           2024/03/21 23:11:08 🔔 Container is ready: 7e037d775014
...

As you see, in the same test-run a new ClickHouse container gets created while we never call Terminate. Tests start failing as they can no longer connect to the old container (to which they hold a connection based on the mapped port).
My suspicion was that Ryuk was for some reason terminating the 'old' ClickHouse container, still live from a previous run.

Looking at the code, this appears indeed what is causing it:

  1. When starting a new test-run, because reusing containers based on name, testcontainers finds the 'old' still running ClickHouse:
    c, err := p.findContainerByName(ctx, req.Name)
  2. For Ryuk however, a new Container is created based on the SessionID:
    r, err := reuseOrCreateReaper(context.WithValue(ctx, core.DockerHostContextKey, p.host), sessionID, p)
  3. The 'old' Ryuk, belonging to the previous test run, won't get a new connection and so after 10s kills itself, but also the still running ClickHouse container.

To fix this, my proposal would be to somehow add the SessionID also to the 'reusable' container name. Either implicitly or by some flag or by exposing the SessionID so the user can add it (currently it's part of the internal package so not reachable).

I'm happy to work on a fix, if I can get any suggestion for a preferred approach.

Relevant log output

docker ps -a output after the failure, showing two reapers and one CH container:

$ docker ps -a
CONTAINER ID   IMAGE                                      COMMAND            CREATED          STATUS          PORTS                                                        NAMES
efa32a26eb2e   testcontainers/ryuk:0.6.0                  "/bin/ryuk"        6 seconds ago    Up 5 seconds    0.0.0.0:32920->8080/tcp, :::32822->8080/tcp                  reaper_39db6ba506d2d713d174270a8ad7aeb95fcdc7e5e13895ae3be33fa70ade946a
0c8e757faff1   clickhouse/clickhouse-server:24.2-alpine   "/entrypoint.sh"   17 seconds ago   Up 16 seconds   9009/tcp, 0.0.0.0:32919->8123/tcp, 0.0.0.0:32918->9000/tcp   otel-clickhouse
cdc2a4e002e6   testcontainers/ryuk:0.6.0                  "/bin/ryuk"        17 seconds ago   Up 16 seconds   0.0.0.0:32917->8080/tcp, :::32821->8080/tcp                  reaper_ae0426fe5786540b6ee4155474bd2ccf2d21bbe9dd1e0134154826996e67fd9b

Additional information

No response

@Alviner
Copy link
Contributor

Alviner commented Mar 22, 2024

As I know reaper is grabbing containers by labels, where session id is already stored. Could you please attach labels with your case?

@mdonkers
Copy link
Contributor Author

Yes, that is correct. Reaper will kill the container based on the matching SessionID label.
The problem is however that the 'reused' container is only selected by name, it doesn't look at the SessionID label. And so a second Reaper will be created, but the previously running 'reused' container is used.

Containers created on first test run:

$ docker container ls --format "table {{.ID}}\t{{.Image}}\t{{.Names}}\t{{.Labels}}"
CONTAINER ID   IMAGE                                      NAMES                                                                     LABELS
bb1591f66ec7   clickhouse/clickhouse-server:24.2-alpine   otel-clickhouse                                                           org.testcontainers.version=0.29.1,build-url=https://github.com/ClickHouse/ClickHouse/actions/runs/8292363504,com.clickhouse.build.githash=9293d361e72be9f6ccfd444d504e2137b2e837cf,org.testcontainers=true,org.testcontainers.lang=go,org.testcontainers.sessionId=aff9f8eec6c8b2cb10a7ced14215192ac300148e6eb93d3902805c904b68dec3
824e4c0cc1ec   testcontainers/ryuk:0.6.0                  reaper_aff9f8eec6c8b2cb10a7ced14215192ac300148e6eb93d3902805c904b68dec3   org.testcontainers.sessionId=aff9f8eec6c8b2cb10a7ced14215192ac300148e6eb93d3902805c904b68dec3,org.testcontainers.version=0.29.1,org.testcontainers=true,org.testcontainers.lang=go,org.testcontainers.reaper=true,org.testcontainers.ryuk=true

Now I'm running tests for a second time, see the output, where the ClickHouse container with ID bb1591f66ec7 is picked up again initially:

2024/03/22 23:04:40 🐳 Creating container for image testcontainers/ryuk:0.6.0
2024/03/22 23:04:40 ✅ Container created: f59540ea88b6
2024/03/22 23:04:40 🐳 Starting container: f59540ea88b6
2024/03/22 23:04:41 ✅ Container started: f59540ea88b6
2024/03/22 23:04:41 🚧 Waiting for container id f59540ea88b6 image: testcontainers/ryuk:0.6.0. Waiting for: &{Port:8080/tcp timeout:<nil> PollInterval:100ms}
2024/03/22 23:04:41 🔔 Container is ready: f59540ea88b6
2024/03/22 23:04:41 ✅ Container started: bb1591f66ec7
2024/03/22 23:04:41 🚧 Waiting for container id bb1591f66ec7 image: clickhouse/clickhouse-server:24.2-alpine. Waiting for: &{timeout:0xc000ca6038 URL:0x1335700 Driver:clickhouse Port:9000/tcp startupTimeout:60000
000000 PollInterval:100ms query:SELECT 1}                                                                                                                                                                           2024/03/22 23:04:41 🔔 Container is ready: bb1591f66ec7
[clickhouse][conn=1][127.0.0.1:33094][handshake] ->  0.0.0

At some point during the test, the previous Reaper kills the existing ClickHouse, as you can see with the mapped port returning EOF. And a new container is created:

2024/03/22 23:04:42 🔔 Container is ready: bb1591f66ec7
2024/03/22 23:04:42 ✅ Container started: bb1591f66ec7
2024/03/22 23:04:42 🚧 Waiting for container id bb1591f66ec7 image: clickhouse/clickhouse-server:24.2-alpine. Waiting for: &{timeout:0xc00074c0a8 URL:0x1335700 Driver:clickhouse Port:9000/tcp startupTimeout:60000
000000 PollInterval:100ms query:SELECT 1}
    testutils.go:74: Post "http://localhost:33095?query=INSERT+INTO+<table>+FORMAT+JSON": EOF
2024/03/22 23:04:42 failed accessing container logs: Error response from daemon: can not get logs from container which is dead or marked for removal
    testcontainerprovider.go:43: unexpected container status "removing": failed to create container
2024/03/22 23:04:42 🐳 Creating container for image clickhouse/clickhouse-server:24.2-alpine
2024/03/22 23:04:42 ✅ Container created: eb9537fc1ef1
2024/03/22 23:04:42 🐳 Starting container: eb9537fc1ef1
2024/03/22 23:04:42 ✅ Container started: eb9537fc1ef1
2024/03/22 23:04:42 🚧 Waiting for container id eb9537fc1ef1 image: clickhouse/clickhouse-server:24.2-alpine. Waiting for: &{timeout:0xc000ca6068 URL:0x1335700 Driver:clickhouse Port:9000/tcp startupTimeout:60000
000000 PollInterval:100ms query:SELECT 1}                                                                                                                                                                           2024/03/22 23:04:43 🔔 Container is ready: eb9537fc1ef1
[clickhouse][conn=1][127.0.0.1:33097][handshake] ->  0.0.0

A new container is created, but the killed container already causes some tests to fail.

@codefromthecrypt
Copy link
Contributor

I'm not sure if it is related, but I ran into an issue when I moved tests into different packages (so different TestMain to setup testcontainers k3s). After one package finished, the next dies unless I disable ryuk

2024/03/24 20:06:23 🔥 Reaper obtained from Docker for this test session 0c4d803193470acd6fc5019c9a96e32c3a1b34a7719b5d77cad86acc3d3bf9ba
2024/03/24 20:06:23 skipping due to docker error: dial tcp [::1]:32777: connect: connection refused: Connecting to Ryuk on localhost:32777 failed: connecting to reaper failed: failed to create container

@mdonkers
Copy link
Contributor Author

I'm not sure if it is related, but I ran into an issue when I moved tests into different packages (so different TestMain to setup testcontainers k3s). After one package finished, the next dies unless I disable ryuk

2024/03/24 20:06:23 🔥 Reaper obtained from Docker for this test session 0c4d803193470acd6fc5019c9a96e32c3a1b34a7719b5d77cad86acc3d3bf9ba
2024/03/24 20:06:23 skipping due to docker error: dial tcp [::1]:32777: connect: connection refused: Connecting to Ryuk on localhost:32777 failed: connecting to reaper failed: failed to create container

I think that could indeed be related. Two packages run with two different SessionIDs ( during the test, you should see multiple Ryuk / Reapers created ). The 'reused' container will be killed by one of the Reapers shutting down, if those tests have finished and timed out. But because the other test package connects to a different Reaper, it does not prevent the 'other' Reaper to shutdown and take the reused container with it.

So either Reaper should be fully re-used as well, with the same semantics (e.g. SessionID not part of the container name), or the 'reused' container needs to be scoped to the session / package as well.

Now I could see both making sense from a user perspective, so it has to be a config option somehow.

@cfstras
Copy link
Contributor

cfstras commented Apr 8, 2024

I think I'm being hit by the same problem - I have two separate packages using the same ContainerRequest definition (from a util package) to start a database.
in about 50% of cases, the tests fail as such:

🔥 Reaper obtained from Docker for this test session 2a922fde398d6fe9e8f12ca45ef893b6ef3c42c41eacbd4abb2c2ce2c0590ea5
    db.go:57:
                Error Trace:    /.../internal/testutil/db.go:57
                                                        /opt/homebrew/Cellar/go/1.22.2/libexec/src/sync/once.go:74
                                                        /opt/homebrew/Cellar/go/1.22.2/libexec/src/sync/once.go:65
                                                        /.../internal/testutil/db.go:41
                                                        /.../internal/db/migrations_test.go:22
                Error:          Received unexpected error:
                                dial tcp [::1]:32829: connect: connection refused: Connecting to Ryuk on localhost:32829 failed: connecting to reaper failed: failed to create container
                Test:           TestMigrate
                Messages:       Could not start postgres container

As a workaround, I'm setting TESTCONTAINERS_RYUK_DISABLED=true, which means I have to be careful about cleanup, but at least the tests work.

@stevenh
Copy link
Contributor

stevenh commented Jul 13, 2024

This should be fixed by testcontainers/moby-ryuk#121, which needs a release and an update to testcontainters-go to use the new image.

Try cloning the moby-ryuk repo and running the following in it to replace the image that testcontainers-go uses to see if it does fix:

docker build -f linux/Dockerfile -t testcontainers/ryuk:0.7.0 .

@mdonkers
Copy link
Contributor Author

Hi @stevenh ,
I'm afraid your fix won't fix this specific issue. See my original post, specifically;

For Ryuk however, a new Container is created based on the SessionID

The problem is that the 'old' Ryuk instance doesn't get any connection from testcontainers on the second run, and brings down any reused containers with it.
To get this issue resolved, a fix is needed on testcontainers side. Not Ryuk itself.

(for completeness I did try your guidelines above but running integration tests shortly in succession still makes them fail second time)

@mdelapenya
Copy link
Collaborator

Hi all, sorry for not going back to this issue, for some reason I was looking at other ones. My fault.

Two packages run with two different SessionIDs

This is not correct. The SessionID is obtained by the parent process ID for the test process, which for each package and subpackage is the original go test. So package foo, bar and foo/baaz all three will share the same parent process ID, therefore, the same SessionID.

I'm also confirming that this issue is hit when reusing containers, is that correct?

When starting a new test-run, because reusing containers based on name, testcontainers finds the 'old' still running ClickHouse

I think we should mark the Reuse mode as experimental, as we are working on a more comprehensive reuse mode that will work across all testcontainers libs. Therefore, the most honest thing we can do with the current Reuse mode is to mark it as Deprecated. But I'd like to hear your feedback on this.

@mdonkers
Copy link
Contributor Author

@mdelapenya no rush, and thanks for any input you can give.

Re:

I'm also confirming that this issue is hit when reusing containers, is that correct?

Yes, as far as I'm able to reproduce and understand the issue it's only hit when reusing containers.

I think we should mark the Reuse mode as experimental, as we are working on a more comprehensive reuse mode that will work across all testcontainers libs. Therefore, the most honest thing we can do with the current Reuse mode is to mark it as Deprecated. But I'd like to hear your feedback on this.

I would very much dislike that, because it makes a major difference in run time for our integration tests. And not knowing when a more comprehensive reuse mode would land (or if this even will remain a 'free' feature??), I would very much like to keep using this feature.

And as I stated in my original post I think the solution could be fairly simple:

To fix this, my proposal would be to somehow add the SessionID also to the 'reusable' container name. Either implicitly or by some flag or by exposing the SessionID so the user can add it (currently it's part of the internal package so not reachable).

For which I'm even willing to work on a fix, given there is some guidance on a preferred solution.

@mdelapenya
Copy link
Collaborator

mdelapenya commented Jul 16, 2024

@mdonkers thanks for your feedback. I'm sensitive with your use case and appreciate your input. If the feature is used, we maintainers must take care of the use case. As a last resort, and I'm not advancing anything, just giving alternatives, we could work with you to refine the test suite to avoid the reusability of the container looking for alternatives, but in any case, we must support you with your current setup, as promised by a public API.

For the reuse use case, because the container name is fixed by each user, adding the session ID to the container name, and offering a way to retrieve it from the container (there are public methods in the container interface), or using Inspect, would be enough, right? That would be a very simple, and narrowed to the reuse use case, solution. If you want to work on that, I'll be more than happy to review it asap.

@mdonkers
Copy link
Contributor Author

..., we could work with you to refine the test suite to avoid the reusability of the container looking for alternatives, ...

Unfortunately, we have created a rather custom testing setup resembling some BDD structure, that would not allow us to restructure the tests in such a way that the container can otherwise be reused.
And not allowing the reuse will make a integration test run jump from ~10 seconds to 2-3 minutes.

For the reuse use case, because the container name is fixed by each user, adding the session ID to the container name, and offering a way to retrieve it from the container (there are public methods in the container interface), or using Inspect, would be enough, right? That would be a very simple, and narrowed to the reuse use case, solution. If you want to work on that, I'll be more than happy to review it asap.

Yes, that would work. I'll be happy to look into this. Hopefully can manage this week but in any case I'll give a ping when I have a working PR.

@mdelapenya
Copy link
Collaborator

just curious, what technology is reused in that container? a database? postgres?

@mdelapenya
Copy link
Collaborator

@mdonkers I've thought about this more in depth: if we append the sessionID to the container name, then two different test sessions (two terminals) will see different container names. Is that what you expect?

@stevenh
Copy link
Contributor

stevenh commented Aug 8, 2024

This should be fixed by a combination of testcontainers/moby-ryuk#141 and the reaper changes in #2664

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue with the library
Projects
None yet
Development

No branches or pull requests

6 participants