Database actions should be tolerant of cache (redis or memcache) connection failures #557

andrewsg · 2020-10-08T21:14:35Z

To make database calls more robust, we should be tolerant of cache connection issues and issue a warning instead of a fatal error; a flag on the cache configuration object could toggle this behavior for users who desire the strict behavior.

andrewsg · 2020-10-08T21:17:08Z

Some things to think about before implementing: Are there cache invalidation implications for fault tolerance here? Is there any way to be totally safe about cache invalidation in the event the database call succeeds but the cache call fails? (For that matter, are we 100% robust to cache invalidation issues even with the current behavior, given the error in some cases will be raised after the database call is complete?)

chrisrossi · 2020-10-09T17:46:04Z

@andrewsg Yeah, invalidation can be a problem with writes. If a write succeeds in Datastore but then fails in the cache, you can have other clients pulling stale data from the cache. We might need different policies for read/write.

Here's some out loud thinking.

Reads are easy. A failure during a read can be translated to a cache miss without any real problems. Easy case. A default to warn and move on is reasonable

For writes, we could swap the order around so we attempt cache invalidation before writing to Datastore. If the write to Datastore fails, this isn't a problem, clients will just get the stored value from Datastore and repopluate the cache. If the write to the cache fails, we have an opportunity to decide 1) to abort the write, so that the cache state will reflect the database state, or 2) write to Datastore anyway, if clients retrieving stale data is tolerable (depends on the application).

There really is an unavoidable trade off here: data integrity or application stability?

Let's say we go with 2). Adding retry functionality for the cache invalidation can help us avoid the the problem altogether if the problem is transient, so we should probably do that. If we exhaust retries, then there is probably a more sustained outage occurring that is also affecting reads, which is good news, because it means clients aren't getting stale data from the cache. If we take the extra step to set a flag on the Cache when there is a connection error that tells it to the clear the cache next time it tries anything, we should be able to flush any stale data fairly quickly after the cache comes back online. This can't guarantee non-stale data for every request, but it can make instances of stale data rare and short lived.

Closes googleapis#557

* feat: fault tolerance for global caches Closes #557 * Fix spelling.

product-auto-label bot added the api: datastore Issues related to the googleapis/python-ndb API. label Oct 8, 2020

yoshi-automation added the triage me I really want to be triaged. label Oct 9, 2020

chrisrossi self-assigned this Oct 9, 2020

yoshi-automation added the 🚨 This issue needs some love. label Oct 13, 2020

chrisrossi pushed a commit to chrisrossi/python-ndb that referenced this issue Oct 15, 2020

feat: fault tolerance for global caches

06d7a44

Closes googleapis#557

chrisrossi mentioned this issue Oct 15, 2020

feat: fault tolerance for global caches #560

Merged

chrisrossi pushed a commit to chrisrossi/python-ndb that referenced this issue Oct 18, 2020

feat: fault tolerance for global caches

c2059da

Closes googleapis#557

andrewsg closed this as completed in #560 Oct 22, 2020

andrewsg pushed a commit that referenced this issue Oct 22, 2020

feat: fault tolerance for global caches (#560)

8ab8ee0

* feat: fault tolerance for global caches Closes #557 * Fix spelling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database actions should be tolerant of cache (redis or memcache) connection failures #557

Database actions should be tolerant of cache (redis or memcache) connection failures #557

andrewsg commented Oct 8, 2020

andrewsg commented Oct 8, 2020

chrisrossi commented Oct 9, 2020

Database actions should be tolerant of cache (redis or memcache) connection failures #557

Database actions should be tolerant of cache (redis or memcache) connection failures #557

Comments

andrewsg commented Oct 8, 2020

andrewsg commented Oct 8, 2020

chrisrossi commented Oct 9, 2020