Skip to content
reiddraper edited this page Aug 14, 2012 · 29 revisions

This page is Deprecated

This page has been subsumed by Object Chunking and Garbage Collection

Large Files

This page will document our ideas and questions for large-file support.

Implementation Pseudocode

Writes

  1. Inspect request Content-Length header, determine if it's even worth chunking the request. If so, move to step 2, otherwise perform a "normal" put.

  2. Create a UUID. This UUID will be used to namespace blocks from concurrent updates. For example, we don't want a namespace collision between block 0 of a request/PUT that is in progress to trample over existing data.

  3. Create a metadata/manifest object. This object will have several fields, and be updated as the file is chunked up and written to Riak. The fields are (tentatively):

    schema-version # useful for things like changing block size
    uuid
    bucket
    key
    content-length
    content-md5 # for vtag, maybe otherwise useful too
    time created
    time finished # if complete
    blocks remaining # a set of the blocks to-be-written to Riak
    deleted # bool
    

    The object will be written to the same {Bucket, Key} that the object would. It is expected that there will be siblings occasionally created a this key. Sometimes the siblings will share a UUID and be merged (via a deterministic algo) and sometimes they will have different UUIDs. Depending on the circumstance, we may purposely keep siblings with different UUIDs around (ex. while doing GC on an old version of an object). Reads will be done with the most recent "completed" manifest object.

  4. Webmachine will begin reading in chunks of the PUT request, which a coordinator will write to Riak and update the manifest object accordingly when each (or perhaps in batches) chunk has been written. If the coordinator process dies, the request fails.

Reads

Question: How do byte-range requests work?

  1. Retrieve the object at {Bucket, Key}. Inspect the object to determine if it is a "normal" object, or a metadata/manifest doc. If this object is "normal", return it, otherwise proceed to step 2.

  2. There might be metadata/manifest siblings. If so, resolve all of the groups of siblings with the same UUID. If there are any metadata objects that are not marked as deleted, order them by date-created (or maybe date-finished) and choose the most recent one. If all of the manifest objects are marked as deleted, return 404. Calculate the chunks the be requested with the Content-Length, block size and UUID. Begin requesting chunks and stream them back to the user as they are returned (in order). Perhaps we have a limit on allowing no more than N outstanding chunk requests. For example, when the request comes in we parallelize and start fetching the first five chunks. Somewhere down the line, we don't hear back from chunk 42 in a timely manner, we don't want to have sent requests out for chunks 43-288.

  3. Perform GC on any metadata objects that "lost" in the most recently updated ordering. QUESTION: How do we make sure there is only a singleton(ish) GC for any given UUID?

Deletes

  1. Retrieve the object at {Bucket, Key}, if it is a normal object, delete it as usual. If it is a metadata object, proceed to step 2.

  2. Perform sibling resolution. Mark all remaining metadata docs (which will have different UUIDS) as deleted. We can do this because we are last-write-wins. If another process is in the middle of uploading a large file and then another one issues delete, tough luck. Launch processes to begin GC of each of the versions (w/ UUID) of the object.

GC

  1. If it's not already, mark the metadata document as deleted. Begin launching processes to delete the individual blocks, as they are deleted, add to a set called blocks_deleted (or remove items from blocks_written?). Conflict resolution of metadata docs should be straightforward.

  2. When the metadata doc has been marked as deleted and there are no remaining blocks to be deleted, the metadata doc itself can be deleted.

Questions

  • What should the block size be?
  • What are the pro/cons of CAS?
    • pros
      • intelligent caching (w/ Andy's filename stripping idea)
    • cons
      • manifest file needs to store hashes, instead of individual block names being generated like range(0 - (content-length / block-size))
  • Should/can we be doing any compression?
    • can we speed things up by opting out of compression for certain MIME types?
  • We currently only have plans to launch GC on DELETE/GET/PUT. Do we need to occasionally scan the cluster for cases where there was a crash during GC and the object hasn't been touched since (no GET or PUT)?
  • What kind of concurrency can we use to store and retrieve the chunks?
  • What are the N/R/W etc. params for metadata and chunks?
  • What sort of checksumming (if any) do we need?
  • What's limiting us from raising the block size?
    • riak object must be deserialized "at once"
    • we pass full riak objects into erlang mailboxes, not streamed
    • ?
  • should we be recommending any particular file system?

Longer-Term Ideas

  • Rack awareness
    • in terms of storing data on multiple racks
    • in terms of retrieving chunks from the same rack when possible
  • larger block size
  • multiple-disk support
  • dedup!
  • How do we avoid storing 9 copies with N=3 and 3 DCs?
    • could this be done by "overriding" bucket defaults gossip for a particular DC?
  • Consider Brewer's idea: distinguish between storage nodes and "data-serving" cache nodes so they can be scaled separately