Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide deployment services check #714

Closed
wants to merge 1 commit into from
Closed

Conversation

EDsCODE
Copy link
Member

@EDsCODE EDsCODE commented Mar 24, 2023

**Initial draft changes so discussion can be done here

Description

based on https://posthog.slack.com/archives/C05034MPYA0/p1679596107221869, the decide pods still rely postgres because of the startup checks. This means that if postgres is down, decide pods will be down and error out regardless of caching

Discussion points

  • Possibly, split up wait checks (postgres // rest of the services) so there's not repeated check scripts for each deployment
  • If possible, I think the decide deployment should wait briefly for postgres and migrations and timeout after X amount of time. Not sure the exact mechanics here but we don't want decide deployment to never connect to postgres/pgbouncer.
  • Main point is that postgres isn't mission critical for decide deployment to operate as it can use a populated redis cache so we should not have the deployment be blocked by postgres being up

- /bin/sh
- -c
- >
{{ if .Values.clickhouse.enabled }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need Clickhouse either?

@neilkakkar
Copy link
Contributor

If possible, I think the decide deployment should wait briefly for postgres and migrations and timeout after X amount of time. Not sure the exact mechanics here but we don't want decide deployment to never connect to postgres/pgbouncer.

This just checks the current status of services to determine if the pod can come up, so it doesn't mean we'll never connect to postgres/pgbouncer.

I think instead we'd want something like an "OR" check between redis and postgres? If either is up, we're good to go.

@neilkakkar
Copy link
Contributor

split up wait checks (postgres // rest of the services) so there's not repeated check scripts for each deployment

Good idea, not clear enough about things here, would want @PostHog/team-infra to give their input 😅

@tiina303
Copy link
Contributor

Related: looked at the readiness and startup and liveness probes:

  1. readiness & startup are both _readyz?role=decide (here), which doesn't check PG, but does require cache (here), not sure we'd want to (Redis down PG available situations)?
  2. liveliness check is here - just verifying service is fine 👍

@tiina303 tiina303 requested a review from a team March 24, 2023 13:50
@@ -170,6 +170,5 @@ spec:
securityContext: {{- omit .Values.decide.securityContext "enabled" | toYaml | nindent 12 }}
{{- end }}
initContainers:
{{- include "_snippet-initContainers-wait-for-service-dependencies" . | indent 8 }}
{{- include "_snippet-initContainers-wait-for-migrations" . | indent 8 }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decide still makes queries to PostgreSQL right? So we still need to wait for them to have been applied?

Copy link
Contributor

@tiina303 tiina303 Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might have misunderstood earlier & mislead the ff team ... this would only apply on new container setup, not on restart? If that's true and considering we do rolling deploys this is fine to keep as is. Sorry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if you don't include the migration check, you'll end up with a SELECT all, the, fields, including, ones, that, don't, exist, yet FROM posthog_persons which will error and if we're lucky fall back to redis, or just straight error out. Either way I don't think that's desired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirmed that init containers only run when the pod is created

pods never restart, only the containers within them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then we should investigate, why FF went down, when PG was unavailable (as this was the initial hypothesis iirc and based on what Frank said it's not the case)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants