Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architectural info? #1

Open
cbluth opened this issue Sep 22, 2023 · 1 comment
Open

Architectural info? #1

cbluth opened this issue Sep 22, 2023 · 1 comment

Comments

@cbluth
Copy link

cbluth commented Sep 22, 2023

I'm interested in reviewing this project, cool idea.
Can you provide info on the different components (cmd dir)? And what they aim to do?

@anqurvanillapy
Copy link
Contributor

anqurvanillapy commented Sep 23, 2023

Hi @cbluth, thanks for reaching out!

We created this project for verifying if what we understand is correct, since we're from the same team responsible for some in-house object storage for massive users.

What's Object Storage

About data

An object, or blob, or file, what ever you'd like to call it, consists of the following content:

  • Object meta:
    • Size, length
    • ETag: Entity tag, for identifying object's content, e.g. the same ETag with different file names could indicate that their underlying physical location could be the same
    • Content type, content disposition, expires, etc. Namely those HTTP stuff
    • User meta: User-provided meta, prefixed with X-Amz-Meta-, it's an AWS S3 standard
    • ..., etc.
  • Data, the actual content, object body

About operations

When we talk about filesystem, there are 2 features which are impossble to support in object storage:

  • Access: Namely reads and writes could be sequential, or random
  • Namespacing: Manipulating directories is super fast

So you could only have these in object storage:

  • Sequential and random reads
  • Whole-file write, you want to update a tiny portion of the file, you have to upload the whole file to overwrite it
  • Pseudo-namespacing: E.g. /image/foo.jpg, the /image/ is just a string prefix, you can manipulate objects with prefixes like filtering, but this will be super slow

Why Distributed?

  • To achieve high availability
  • Easy to scale in and out

Remember that our service is stateful, just like databases, it's generally hard to achieve the goals above.

So How?

Availability: Replication and consensus

You can replicate the data (or operation logs) to some backup servers, and have some protocols to make sure the backup servers have the same data with the main servers.

In 2023, Raft is always the starter to do replication and consensus. You have 2 ways to implement your service:

  • In-memory states and periodical snapshotting: If your states won't blow up the memory, you could store them in memory and let the Raft framework inform you to persist upon snapshotting. Typically snapshotting won't affect the log replication and state machine running, they can be parallel
  • Durable states and no-op snapshotting: You could store your states in some fast persistent DB like RocksDB inside your state machine, and upon snapshotting, you simply just record some important Raft states into anywhere you like. The Raft states include terms, current configuration (namely all the peers info in the cluster), configuration index (the configuration is just an array, so the index indicates who my service is in the cluster), etc. Obviously, durable states are suitable for using less memory of your machines

Scalability: Partitioning (sharding) and rebalancing

If you're sure that reading from the followers is consistent for your business, you can route all of your reads to a random peer in your cluster. However, you could only route all the writes to the leader, draining that server with very high load when all your customers are flooding to your service. In this case, you could only scale up, but not scale out, which means you need to buy more expensive servers instead of adding more cheap ones.

So to solve this problem, we could partition our data into multiple leaders. Upon high load, we could add a new follower to the configuration, and rebalance the whole cluster to decrease the load from those running servers.

Generally you have 3 rules to partition your data:

  • Range-based: Your data are indexed by keys of type strings, or some monotonically incremental integers, it's natural to set up some range to do partitioning
  • Hash-based: Hash your keys and place the data to the corresponding buckets, it's very intuitive
  • Range-hash-based, yup the hybrid way, if you want to shard your data in a multiple-level manner

You only have these rules to follow, the actual algorithms of your partitioning is business-defined, just test and adjust.

With respect to partitioning a Raft cluster, that's where the multiraft kicks in, in your server you could have thousands of Raft peers running, one could be a follower, or a leader. A Raft group is just a data partition, nothing fancy. All the locations (the server address that contains multiple Raft groups, not a particular Raft node) of Raft leaders should be stored in an independently running service, called PD (placement driver), or controller, or orchestrator, or locator, just pick your favorite name here. That's because our client should locate the correct partition, and write the data to the Raft leader.

But when would partitioning happen? And how to rebalance them?

In the very beginning, you only have 1 Raft group #0 in, for instance, 4 servers (A, B, C, D), and servers A, B, C own this Raft group, the leader is in server A. You set up a threshold to partition the data, e.g., a maximum number of keys inside a partition, or a maximum size of the parition, if we consider the data. When triggered, you split your partition into two, create a new Raft group #1 and force server A to have the leadership, so it's unnecessary to transfer the second half of the data to other servers, it's costly. We split the data and create the Raft group in-place. If it's successful, we tell PD that a new Raft group is created, with its range of the partition, so the clients could locate new writes to 2 partitions now.

And then, our long running rebalancing worker (maybe in PD? not sure yet) somehow finds that we have 2 partitions and server D is idle, that's not good. We create a learner of Raft group #1 in server D, receiving all the logs and snapshots from other peers, once it's done, we transfer the leadership of server A to server D (or others), and let server A quit the configuration, finishing the rebalancing.

Distributed Object Storage

Now we know how to make a distributed systems with high availability and scalability, how to use these toys to create a distributed object storage?

We create the following components:

  • Gateway: Stateless service, implements the S3 protocol, and locates the current servers to read and write the data. Could scale out by adding new servers
  • Placer: Typically a 3-node Raft cluster, it stores the mapping from key ranges to the locations of all partitions, and all the states of dataservers for future rebalancing and other administrating stuff
  • Dataserver: The multiraft cluster, contains the object metas and object data

But to our knowledge, it's not feasible to combine object meta and data together inside the dataserver, because objects could be huge and tiny, and deletes will be massive every day. We separate it into dataserver and metaserver, the latter is just responsible for object meta only, and the former acts like the following Go function:

type ObjectID string

func alloc(data []byte) ObjectID

Dataserver is only for storing data and its ID, and metaserver would store the mapping between object keys (user-defined, just like filenames) and object IDs.

The design of dataserver from our team is complicated, I want to simplify it and make an intuitive and minimal one without reading their code. Some key notes of the design would be:

  • Data are stored sequentially within a super long fixed-size block circular buffer
  • Blocks are indexed by a table in memory for object ID lookup
  • Delete is an operation of making tombstones on blocks
  • GC acts like how Haskell does, since our object data is immutable, we just need to move the live values to the head of the block queue, where the GC workers run at every midnight

What Else?

  • DR (disaster recovery): We could add learners for metaserver and dataserver instances in different DCs, so data are replicated per partition, we could switch the traffic upon DC failures
  • EC (erasure coding): To achieve high availability and low cost of dataservers, we need to support EC to lower the budget (3-node Raft means 33% of the data are available, but 4+2 EC could achieve 66%, that's a huge boost)

Further Reading

  • Design of TiKV is really inspiring, you could see this post for more insights

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants