Gracefully stop blueprint containers before committing them #363

reivilibre · 2022-04-14T13:29:07Z

This PR arises from a conversation with @richvdh, who was presumably working on getting Synapse tested in Complement with Postgres and workers.

The blueprint container gets paused and then committed, which leads to the Postgres database being corrupted in the image that gets created. This then leads to slow startup for that container, since Postgres has to do some kind of recovery process.

It seems like it would be better to gracefully stop the container before committing it, so that Postgres doesn't get corrupted in the process.
Docker stops a container by SIGTERMing it (so the process inside the container can theoretically react to that and shut down gracefully), then SIGKILLing if it didn't stop after a timeout.

This PR will use the equivalent of docker stop to shut down the container before committing it. A well-written container could then react to the SIGTERM and shut down the database gracefully.

I'm not sure the container image in question is set up to handle SIGTERM properly, though — the container seems to be taking 30 s (= timeout) to stop so I presume it's still getting killed forcefully.
At least this gives us the option to implement that properly...

I used COMPLEMENT_DEBUG=1 WORKERS=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=pwd/../complement ./scripts-dev/complement.sh -run TestInboundFederationKeys 2>&1 | tee log in a Synapse checkout, as Rich suggested.

richvdh · 2022-04-14T13:36:03Z

I'm not sure the container image in question is set up to handle SIGTERM properly, though — the container seems to be taking 30 s (= timeout) to stop so I presume it's still getting killed forcefully.

Yeah, it's absolutely not - we need to rearrange things so that postgres is run by supervisord, so that the SIGTERM gets propagated.

richvdh

this lgtm, but we should probably ask @kegsay what he thinks.

kegsay

This seems to be causing a large delay when deploying blueprints:

federation_room_ban_test.go:16: Deploy times: 1m21.771543808s blueprints, 2.870558152s containers

vs

federation_room_ban_test.go:16: Deploy times: 32.279497517s blueprints, 8.70062934s containers

Likely because the docker container isn't quitting via SIGTERM. Please can we reduce the timeout to something like 10s instead of 30s to reduce the impact of this whilst people update their complement images.

kegsay · 2022-04-19T08:42:39Z

internal/docker/builder.go

+			}
+
+			if !containerInfo.State.Running {
+				// The container isn't running anyway, so no need to kill it.


We are now stopping the containers so I don't see how containerInfo.State.Running can ever be true?

I don't have a great deal of Go experience, but I view this defer as a last-resort clean-up mechanism, like a finally block in Java/etc. Anecdotally, I've had docker stop leave a container running before, but frankly this may well have been a bug (I'm not sure). I'm happy to remove it if you think that's best, but it really was just intended as a last resort 'stopping gracefully is nice, but littering the host with containers is to be avoided if at all possible'.

If you've seen this in the wild then let's leave it in. Docker is async command wise (you'll note several places we repeat the same instruction until timeout/success) so I can believe it not stopping the container when you ask for it to stop.

Gracefully stop containers before committing them

99a119a

richvdh reviewed Apr 14, 2022

View reviewed changes

richvdh requested a review from kegsay April 14, 2022 14:02

kegsay reviewed Apr 19, 2022

View reviewed changes

reivilibre added 2 commits April 21, 2022 10:15

Reduce container stop timeout to 10 seconds

497d7a9

Log a little more

092bda6

kegsay approved these changes Apr 25, 2022

View reviewed changes

reivilibre merged commit de23bb8 into main Apr 27, 2022

reivilibre deleted the rei/stop_before_commit branch April 27, 2022 09:04

MadLittleMods mentioned this pull request Jul 7, 2022

Testing with Complement has gotten so slow - 2 minutes to see test results matrix-org/synapse#13204

Open

matrixbot mentioned this pull request Dec 21, 2023

Testing with Complement has gotten so slow - 2 minutes to see test results element-hq/synapse#13204

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully stop blueprint containers before committing them #363

Gracefully stop blueprint containers before committing them #363

reivilibre commented Apr 14, 2022

richvdh commented Apr 14, 2022

richvdh left a comment

kegsay left a comment

kegsay Apr 19, 2022

reivilibre Apr 21, 2022

kegsay Apr 25, 2022

Gracefully stop blueprint containers before committing them #363

Gracefully stop blueprint containers before committing them #363

Conversation

reivilibre commented Apr 14, 2022

richvdh commented Apr 14, 2022

richvdh left a comment

Choose a reason for hiding this comment

kegsay left a comment

Choose a reason for hiding this comment

kegsay Apr 19, 2022

Choose a reason for hiding this comment

reivilibre Apr 21, 2022

Choose a reason for hiding this comment

kegsay Apr 25, 2022

Choose a reason for hiding this comment