-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database actions should be tolerant of cache (redis or memcache) connection failures #557
Comments
Some things to think about before implementing: Are there cache invalidation implications for fault tolerance here? Is there any way to be totally safe about cache invalidation in the event the database call succeeds but the cache call fails? (For that matter, are we 100% robust to cache invalidation issues even with the current behavior, given the error in some cases will be raised after the database call is complete?) |
@andrewsg Yeah, invalidation can be a problem with writes. If a write succeeds in Datastore but then fails in the cache, you can have other clients pulling stale data from the cache. We might need different policies for read/write. Here's some out loud thinking. Reads are easy. A failure during a read can be translated to a cache miss without any real problems. Easy case. A default to warn and move on is reasonable For writes, we could swap the order around so we attempt cache invalidation before writing to Datastore. If the write to Datastore fails, this isn't a problem, clients will just get the stored value from Datastore and repopluate the cache. If the write to the cache fails, we have an opportunity to decide 1) to abort the write, so that the cache state will reflect the database state, or 2) write to Datastore anyway, if clients retrieving stale data is tolerable (depends on the application). There really is an unavoidable trade off here: data integrity or application stability? Let's say we go with 2). Adding retry functionality for the cache invalidation can help us avoid the the problem altogether if the problem is transient, so we should probably do that. If we exhaust retries, then there is probably a more sustained outage occurring that is also affecting reads, which is good news, because it means clients aren't getting stale data from the cache. If we take the extra step to set a flag on the Cache when there is a connection error that tells it to the clear the cache next time it tries anything, we should be able to flush any stale data fairly quickly after the cache comes back online. This can't guarantee non-stale data for every request, but it can make instances of stale data rare and short lived. |
* feat: fault tolerance for global caches Closes #557 * Fix spelling.
To make database calls more robust, we should be tolerant of cache connection issues and issue a warning instead of a fatal error; a flag on the cache configuration object could toggle this behavior for users who desire the strict behavior.
The text was updated successfully, but these errors were encountered: