Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully stop blueprint containers before committing them #363

Merged
merged 3 commits into from
Apr 27, 2022

Conversation

reivilibre
Copy link
Contributor

This PR arises from a conversation with @richvdh, who was presumably working on getting Synapse tested in Complement with Postgres and workers.

The blueprint container gets paused and then committed, which leads to the Postgres database being corrupted in the image that gets created. This then leads to slow startup for that container, since Postgres has to do some kind of recovery process.

It seems like it would be better to gracefully stop the container before committing it, so that Postgres doesn't get corrupted in the process.
Docker stops a container by SIGTERMing it (so the process inside the container can theoretically react to that and shut down gracefully), then SIGKILLing if it didn't stop after a timeout.

This PR will use the equivalent of docker stop to shut down the container before committing it. A well-written container could then react to the SIGTERM and shut down the database gracefully.

I'm not sure the container image in question is set up to handle SIGTERM properly, though — the container seems to be taking 30 s (= timeout) to stop so I presume it's still getting killed forcefully.
At least this gives us the option to implement that properly...

I used COMPLEMENT_DEBUG=1 WORKERS=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=pwd/../complement ./scripts-dev/complement.sh -run TestInboundFederationKeys 2>&1 | tee log in a Synapse checkout, as Rich suggested.

@richvdh
Copy link
Member

richvdh commented Apr 14, 2022

I'm not sure the container image in question is set up to handle SIGTERM properly, though — the container seems to be taking 30 s (= timeout) to stop so I presume it's still getting killed forcefully.

Yeah, it's absolutely not - we need to rearrange things so that postgres is run by supervisord, so that the SIGTERM gets propagated.

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lgtm, but we should probably ask @kegsay what he thinks.

@richvdh richvdh requested a review from kegsay April 14, 2022 14:02
Copy link
Member

@kegsay kegsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be causing a large delay when deploying blueprints:

federation_room_ban_test.go:16: Deploy times: 1m21.771543808s blueprints, 2.870558152s containers

vs

federation_room_ban_test.go:16: Deploy times: 32.279497517s blueprints, 8.70062934s containers

Likely because the docker container isn't quitting via SIGTERM. Please can we reduce the timeout to something like 10s instead of 30s to reduce the impact of this whilst people update their complement images.

}

if !containerInfo.State.Running {
// The container isn't running anyway, so no need to kill it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are now stopping the containers so I don't see how containerInfo.State.Running can ever be true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a great deal of Go experience, but I view this defer as a last-resort clean-up mechanism, like a finally block in Java/etc. Anecdotally, I've had docker stop leave a container running before, but frankly this may well have been a bug (I'm not sure). I'm happy to remove it if you think that's best, but it really was just intended as a last resort 'stopping gracefully is nice, but littering the host with containers is to be avoided if at all possible'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've seen this in the wild then let's leave it in. Docker is async command wise (you'll note several places we repeat the same instruction until timeout/success) so I can believe it not stopping the container when you ask for it to stop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants