Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alter container startup to wait for container healthcheck to OK. #236

Conversation

michaelkaye
Copy link
Contributor

Wait for healthcheck (if it exists) to return OK before checking version endpoint; In all cases, ensure the version endpoint is available before starting.

Users with an image containing a healthcheck they do not want to use can make a tweak to their container by adding

HEALTHCHECK NONE

to their docker file.

In all cases, ensure the version endpoint is available before starting.
This seems to improve stability of complement when waiting for
a long time and multiple containers to start.

Healthchecks normally poll at a rate of more than 1s so this doesn't
greatly affect the time to start.
Copy link
Member

@kegsay kegsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is for more complex containers that want to say, wait for a postgres to be up and listening first, which may race with 200 OKing the /versions endpoint which can just return static data.

It's questionable whether this should be Complement's job to wait for both healthchecks and /versions but given it's an optional extra and isn't too much more complexity, this looks like a good addition to me and would definitely aid HS devs when writing valid Complement images. LGTM

@kegsay
Copy link
Member

kegsay commented Nov 26, 2021

Please update https://github.com/matrix-org/complement#image-requirements mentioning the HEALTHCHECK.

@michaelkaye
Copy link
Contributor Author

It's questionable whether this should be Complement's job to wait for both healthchecks and /versions but given it's an optional extra and isn't too much more complexity, this looks like a good addition to me and would definitely aid HS devs when writing valid Complement images. LGTM

In my head, the healthcheck is waiting for (whatever the container is) to be ready; the /versions check is for "is this actually now being a matrix homeserver"; I was wondering if we needed any overrides or options for this.

Please update https://github.com/matrix-org/complement#image-requirements mentioning the HEALTHCHECK.

Will do.

Is it a problem that the Tests / complement isn't passing; i've been trying to kick it various ways to work?

@kegsay
Copy link
Member

kegsay commented Nov 27, 2021

Yes it's a problem. We don't normally see synapse tests timing out after 10min. It should run on 3mins.

@kegsay
Copy link
Member

kegsay commented Dec 1, 2021

There's definitely something wrong here, CI seems unhappy even on repeat runs. My guess is that the health check never passes or something?

@kegsay
Copy link
Member

kegsay commented Dec 9, 2021

This cannot merge until matrix-org/synapse#8780 is fixed unfortunately.

@kegsay kegsay added the blocked label Jan 6, 2022
@michaelkaye
Copy link
Contributor Author

I've just had some further investigation here. The containers we launch are synapse without workers - so the recent changes to enable healthchecks for the synapse workers aren't at fault.

I don't believe that complement runs synapse on a non-default port, and the cURL runs inside the container, so I don't believe that could be the issue.

Running this on my personal desktop seems to pass the test just fine, so I'm wondering if there's some limitation in github CI that is causing this; I'm reducing the delays associated with the healthcheck to see if that helps.

@michaelkaye
Copy link
Contributor Author

Apologies for the commit noise - am reverting my tweak to see if that breaks the build and the fix is a real fix.

@michaelkaye
Copy link
Contributor Author

I should mention that after edeb928 was committed, the synapse tests ran in about 3min

Having reverted in 0dfa0bf , the tests have failed again.

I think the tweak is probably valid; I don't want to apply it in the upstream synapse healthcheck configuration, as for production-type purposes, checking 15s after a restart is a good policy; but for the rapid-fire complement usecase, it feels like a reduced delay is the right thing. I wonder if this is causing some issue like a resource exhaustion of some kind - maybe too many socket opens or something; which will happen because each test takes 15s to start, so there are 30 attempts to check healthcheck (500ms between checks)

Re-applying my fix now, if that's green I suggest we should merge.

@michaelkaye
Copy link
Contributor Author

The conflict is quite big; remade this PR on another branch, see #293

@michaelkaye michaelkaye deleted the michaelk/wait_for_container_healthcheck branch March 31, 2022 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants