How do we implement resilient feature flags #13601

neilkakkar · 2023-01-09T13:28:25Z

The how: this issue. Going deep into exactly how we'll implement this and reduce the risk of everything blowing up. This touches a very sensitive code path, so making sure there's zero downtime is important.

This issue seeks to clarify for everyone how we'll get there (and for me to think through how to do it).

Broadly, the things we need to do are:

Introduce caching on decide, for 2 things: (1) Project token to team. (2) teamID to feature flag definitions.: feat(flags): Enable caching for resilient responses #13708
Figure out how to update caches & when to invalidate. Open question: How do we ensure caches are always populated? It's going to be annoying if cache isn't populated and postgres goes down, leaving us to die.
Figure out the code paths: do we always default to cache first, or ~~keep the cache just as a backup~~? Depends partially on the above & the guarantees we have on the cache.
Figure out the semantics of 'best-effort flag calculation': given that postgres is down, what all flags do we want to calculate & how will this work?
Update client libraries to use the new decide response & update only flags sent by decide, keep the old ones as is, unless there were no errors during computation, in which case replace all flags.: feat(feature-flags): Allow upserting decide v3 responses posthog-js-lite#53
.. And don't obliterate flags on decide 500 responses.: feat(feature-flags): Allow upserting decide v3 responses posthog-js-lite#53

For reducing risk, it makes sense to break down server changes into discrete independent parts

project API key -> teamID caching
teamID -> flag definitions caching
best effort flag evaluation

neilkakkar · 2023-01-09T14:11:47Z

re: caching, I'm thinking of introducing post-save/update hooks, which ensure that when a flag is updated, the cache is updated as well.

These can sometimes fail, which leads to the cache being out of date, which isn't great. We could introduce a ttl for this, but the new problem becomes: if the ttl is too low (like, say 5 minutes); then the DB going down means we're going to go down anyway, as the information will be lost. If it's too high, chances are things will be stale for a longer while 🤔 .

It's probably better to have this in the update/create flow itself, guaranteeing that the request is a success only when the cache is updated too. Yep, this seems better. We can have longer ttls with this as well.

Making some constraints explicit:

We can't really go for a cache-aside strategy, where we read from the cache, and fallback to the db if it misses (well, not by default, and not all the time at least), since the point is to defend against the db going down sporadically because too many connections etc. etc.
Given the above, we necessarily want to populate the cache on startup.
3. We can possibly subvert this by relaxing the constraint above & treating this like a regular cache. But that is not a worthy trade-off imo, as it destroys the guarantee we were looking for in the first place.

Do we need TTLs at all then? Not really, since there's no big risk of things going out of sync.

neilkakkar · 2023-01-09T14:24:44Z

regarding size limits, the current feature flag table (which will effectively be cached) is less than 3 MB in size:

SELECT pg_size_pretty(pg_total_relation_size('posthog_featureflag'))

so, we're good here size wise for a long time.

project token to teamID is even smaller, at O(number of teams)

ellie · 2023-01-13T13:43:52Z

Have we considered caching outside of our main app deployment? Ie in the case that our app is totally down (lb failure/misconfig, dodgy deploy, uncaught logic problem or otherwise), customers can still resolve flags for the last state they were in

There's definitely a bunch we could do within AWS that could make this incredibly resilient

neilkakkar · 2023-01-13T13:47:12Z

Great idea! Haven't yet, but I expect it to be a lot more plug-and-play, changing redis servers once the basic code is in place (correct me if I'm wrong!)

At that stage, would love some support from infra to making this more robust.

neilkakkar · 2023-01-13T13:49:17Z

Ah, wait, no, if the entire app deployment is down, /decide api endpoint is down too, so the above doesn't help 🤔 .

Isn't this effectively then having a second app deployment? Since we can't/don't want to cache responses, but the flag definitions.

ayr-ton · 2023-02-24T21:25:02Z

Do you think updating the flutter sdk to support normal flags and the new decide can be part of the tasks?
There is an open issue about it: #12222

neilkakkar · 2023-02-24T22:09:52Z

A PR is already out for that, should be going in soon 👀

This was referenced Jan 10, 2023

feat(flags): Decide returns all flags, not just active flags #13616

Merged

Resiliency and JSON Flags Spec #13627

Closed

feat(flags): Enable caching for resilient responses #13708

Merged

neilkakkar changed the title ~~WIP: How do we implement resilient feature flags~~ How do we implement resilient feature flags Jan 19, 2023

neilkakkar closed this as completed Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we implement resilient feature flags #13601

How do we implement resilient feature flags #13601

neilkakkar commented Jan 9, 2023 •

edited

Loading

neilkakkar commented Jan 9, 2023 •

edited

Loading

neilkakkar commented Jan 9, 2023

ellie commented Jan 13, 2023

neilkakkar commented Jan 13, 2023

neilkakkar commented Jan 13, 2023

ayr-ton commented Feb 24, 2023 •

edited

Loading

neilkakkar commented Feb 24, 2023

How do we implement resilient feature flags #13601

How do we implement resilient feature flags #13601

Comments

neilkakkar commented Jan 9, 2023 • edited Loading

neilkakkar commented Jan 9, 2023 • edited Loading

neilkakkar commented Jan 9, 2023

ellie commented Jan 13, 2023

neilkakkar commented Jan 13, 2023

neilkakkar commented Jan 13, 2023

ayr-ton commented Feb 24, 2023 • edited Loading

neilkakkar commented Feb 24, 2023

neilkakkar commented Jan 9, 2023 •

edited

Loading

neilkakkar commented Jan 9, 2023 •

edited

Loading

ayr-ton commented Feb 24, 2023 •

edited

Loading