-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Architectural info? #1
Comments
Hi @cbluth, thanks for reaching out! We created this project for verifying if what we understand is correct, since we're from the same team responsible for some in-house object storage for massive users. What's Object StorageAbout dataAn object, or blob, or file, what ever you'd like to call it, consists of the following content:
About operationsWhen we talk about filesystem, there are 2 features which are impossble to support in object storage:
So you could only have these in object storage:
Why Distributed?
Remember that our service is stateful, just like databases, it's generally hard to achieve the goals above. So How?Availability: Replication and consensusYou can replicate the data (or operation logs) to some backup servers, and have some protocols to make sure the backup servers have the same data with the main servers. In 2023, Raft is always the starter to do replication and consensus. You have 2 ways to implement your service:
Scalability: Partitioning (sharding) and rebalancingIf you're sure that reading from the followers is consistent for your business, you can route all of your reads to a random peer in your cluster. However, you could only route all the writes to the leader, draining that server with very high load when all your customers are flooding to your service. In this case, you could only scale up, but not scale out, which means you need to buy more expensive servers instead of adding more cheap ones. So to solve this problem, we could partition our data into multiple leaders. Upon high load, we could add a new follower to the configuration, and rebalance the whole cluster to decrease the load from those running servers. Generally you have 3 rules to partition your data:
You only have these rules to follow, the actual algorithms of your partitioning is business-defined, just test and adjust. With respect to partitioning a Raft cluster, that's where the multiraft kicks in, in your server you could have thousands of Raft peers running, one could be a follower, or a leader. A Raft group is just a data partition, nothing fancy. All the locations (the server address that contains multiple Raft groups, not a particular Raft node) of Raft leaders should be stored in an independently running service, called PD (placement driver), or controller, or orchestrator, or locator, just pick your favorite name here. That's because our client should locate the correct partition, and write the data to the Raft leader. But when would partitioning happen? And how to rebalance them? In the very beginning, you only have 1 Raft group And then, our long running rebalancing worker (maybe in PD? not sure yet) somehow finds that we have 2 partitions and server D is idle, that's not good. We create a learner of Raft group Distributed Object StorageNow we know how to make a distributed systems with high availability and scalability, how to use these toys to create a distributed object storage? We create the following components:
But to our knowledge, it's not feasible to combine object meta and data together inside the dataserver, because objects could be huge and tiny, and deletes will be massive every day. We separate it into dataserver and metaserver, the latter is just responsible for object meta only, and the former acts like the following Go function: type ObjectID string
func alloc(data []byte) ObjectID Dataserver is only for storing data and its ID, and metaserver would store the mapping between object keys (user-defined, just like filenames) and object IDs. The design of dataserver from our team is complicated, I want to simplify it and make an intuitive and minimal one without reading their code. Some key notes of the design would be:
What Else?
Further Reading
|
I'm interested in reviewing this project, cool idea.
Can you provide info on the different components (cmd dir)? And what they aim to do?
The text was updated successfully, but these errors were encountered: