Active Deletion of Small Objects

Original pull request: https://github.com/basho/riak_cs/pull/1174

In case where object size tend to be small in distribution objects could be deleted on the fly.

The tradeoff here is in the efficiency of garbage collection (+disk reclaim speed) and delete request latency. Instead of moving manifests to GC bucket, this skips garbage collection and deletes blocks directly. This makes disk space reclaim more aggressive. This is useful especially when objects are rather small like less than one megabyte and garbage collector is not well catching up the pace of object deletions.

How to use

To use this feature, turn this on via riak-cs.conf like this:

active_delete_threshold = 1mb

Restart required in this case. With this, objects with its content-size smaller than 1049576 bytes are to be deleted when DELETE Object or DELETE Multiple Objects API is requested.

Or if you do not want to stop the CS node, attaching to the node by riak-cs attach and hit following command:

> Threshold = 1024*1024.
> application:set_env(riak_cs, active_delete_threshold, Threshold).

Things to be cared in production

Concurrent reads may be bothered by block deletion especially if the threshold is larger as to cover multiple blocks under single object, resulting in read stall or unreasonable connection close. This is because leeway period is not considered and delete of blocks happen immediately.

If active block deletion is enabled in replication-enabled cluster,

Make sure block tombstones are being replicated in realtime: not having a line {replicate_cs_block_tombstone, false} in advanced.config of Riak configuration.
If block tombstones dropped at RTQ, there could happen blocks leaked in sink side. This is because a corresponding object manifest will be erased and replicated to sink cluster.
To handle old manifests resides in fallback nodes and returned by handoff, manifests are to be kept in history so as not to let them resurrect.

Operations to compensate in case of blocks leaked

Riak CS has (un)official toolkit to find inconsistent block and manifests. Refer to documentation for usages and further information.

Misc

From riak_cs_gc.erl,

                %% We do synchronous delete after it is marked
                %% pending_delete, to reduce the possibility where
                %% concurrent requests find active manifest (UUID) and
                %% go find deleted blocks resulting notfound stuff.
                %% However, there are still corner cases where
                %% concurrent requests interleaves between marking
                %% pending_delete here and deleting blocks, like:
                %%
                %% 1. Request A refers to a manifest finding active UUID x
                %% 2. Request B deletes an object marking active UUID x as pending_delete
                %% 3. Request B deletes blocks of UUID x according to this synchronous delete -> ok
                %% 4. Request A refers to blocks pointed by UUID x -> notfound
                %%
                %% Manifests with blocks deleted here, have
                %% `scheduled_delete' state here. They won't be
                %% collected by garbage collector, as they are not
                %% stored in GC bucket. Instead they will be collected
                %% in `riak_cs_manifest_utils:prune/1' invoked via GET
                %% object, after leeway period has passed.
                maybe_delete_small_objects(PDManifests0, RcPid, Threshold);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Active Deletion of Small Objects

How to use

Things to be cared in production

Operations to compensate in case of blocks leaked

Misc

Clone this wiki locally