-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do we implement resilient feature flags #13601
Comments
re: caching, I'm thinking of introducing post-save/update hooks, which ensure that when a flag is updated, the cache is updated as well. These can sometimes fail, which leads to the cache being out of date, which isn't great. We could introduce a ttl for this, but the new problem becomes: if the ttl is too low (like, say 5 minutes); then the DB going down means we're going to go down anyway, as the information will be lost. If it's too high, chances are things will be stale for a longer while 🤔 . It's probably better to have this in the Making some constraints explicit:
Do we need TTLs at all then? Not really, since there's no big risk of things going out of sync. |
regarding size limits, the current feature flag table (which will effectively be cached) is less than 3 MB in size:
so, we're good here size wise for a long time. project token to teamID is even smaller, at |
Have we considered caching outside of our main app deployment? Ie in the case that our app is totally down (lb failure/misconfig, dodgy deploy, uncaught logic problem or otherwise), customers can still resolve flags for the last state they were in There's definitely a bunch we could do within AWS that could make this incredibly resilient |
Great idea! Haven't yet, but I expect it to be a lot more plug-and-play, changing redis servers once the basic code is in place (correct me if I'm wrong!) At that stage, would love some support from infra to making this more robust. |
Ah, wait, no, if the entire app deployment is down, Isn't this effectively then having a second app deployment? Since we can't/don't want to cache responses, but the flag definitions. |
Do you think updating the flutter sdk to support normal flags and the new decide can be part of the tasks? |
A PR is already out for that, should be going in soon 👀 |
The what & why: PostHog/meta#74
The how: this issue. Going deep into exactly how we'll implement this and reduce the risk of everything blowing up. This touches a very sensitive code path, so making sure there's zero downtime is important.
This issue seeks to clarify for everyone how we'll get there (and for me to think through how to do it).
Broadly, the things we need to do are:
Introduce caching on decide, for 2 things: (1) Project token to team. (2) teamID to feature flag definitions.: feat(flags): Enable caching for resilient responses #13708
Figure out how to update caches & when to invalidate. Open question: How do we ensure caches are always populated? It's going to be annoying if cache isn't populated and postgres goes down, leaving us to die.
Figure out the code paths: do we always default to cache first, or
keep the cache just as a backup? Depends partially on the above & the guarantees we have on the cache.Figure out the semantics of 'best-effort flag calculation': given that postgres is down, what all flags do we want to calculate & how will this work?
Update client libraries to use the new decide response & update only flags sent by decide, keep the old ones as is, unless there were no errors during computation, in which case replace all flags.: feat(feature-flags): Allow upserting decide v3 responses posthog-js-lite#53
.. And don't obliterate flags on decide 500 responses.: feat(feature-flags): Allow upserting decide v3 responses posthog-js-lite#53
For reducing risk, it makes sense to break down server changes into discrete independent parts
The text was updated successfully, but these errors were encountered: