-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2 Storage and Distribution System Specification #2224
Comments
Additional requirements:
|
A requirement we should consider adding is some way to require a claimed liason confirming the upload to prove that the uploader at least contacted them for this purpose. But probably not needed right away. |
An obvious point which is missing, but was pointed out by @mnaamani , is that distributors only store a subset of all videos in their corresponding bucket in their cache which is managed using a Time aware least recently used policy. This will radically reduce the storage requirements of a distributors, and will in practice work well. See section on cache policies here. There should probably be distinct API for doing the very fast fetching that is required when there is a cache miss, perhaps even over a persistent connection. |
Question: A question on a recent call came up about why uploading, at least when initiated through an off-chain signer (like a member or channel owner), should not be possible through a direct extrinsic on the storage module. Answer:
|
Background
The current Joystream network, as of the
Sumer
release, has an extremely limited system for storing and distributing data, both in terms of functionality and ability of the system to accommodate any kind of even limited scale of utilisation. This specification is intended to substantially improve upon these limitation by settling the organisation and function of the system at a level suitable for mainnet purposes. Importantly, this specification should be read in the context of the Gateway specification #2089, which is complementary in that it outlines an incentive model for the vast majority of expected load on distributors.Major Changes
The overall design philosophy of the system remains the same as before in the following respects
However the following major changes are introduced
Architecture
Working Groups and Roles
There are two working groups, the storage working group and the distribution working groups, each with its own set of workers and two separate leads, called the storage lead and distribution lead respectively. The workers in the storage working group are called storage providers, and operate dedicated nodes for this purpose, called storage nodes. Likewise for the distribution group there is distribution providers and distribution nodes.
Activity
There are two major service provider activities: storing data and distributing data. By storing we mean long term archiving, and by implication accepting uploads of new data over time. Storing data also implies distributing data to others in order for them to persist it, for example as new data or providers enter the system. The primary operational requirement of this activity is the ability to securely persist and recover very large datasets. By distributing data we mean streaming transmission of a small subset of of all available data, with low latency and high throughput, to a large number of simultaneous users. Initial data copies are obtained from storage providers. All of communication uses HTTP, specifically
When new data is added to the system, the chain initially determines how it should be stored and distributed, but later manual intervention can update this.
Importantly, the storage infrastructure is not public facing, it only serves itself and distribution infrastructure. The distribution infrastructure is public facing, however, the specific policy around what kind of data is accessible to different audiences (everyone, members only, etc.) is a policy variable which goes outside of this specification. It may also be subject to contextual information around the specific use case to which the data corresponds, so data from the content system for example may have explicit owner specifiable access policies, like paywalls, subscriptions, etc.
Fees
The only fees facing users are associated with publishing new data into the system, namely
Concepts
Data Object
The fundamental concept in the system is a data object, which represents single static binary object in the system. The main goal of the system is to retain an index of all such objects, including who owns them, and information about what actors are currently tasked with storing and distributing them to end users. The system is unaware of the underlying content represented by such an object, as it is used by different parts of the Joystream system. It can represent assets as diverse as
Bags
A data object bag, or bag for short, is a dynamic collection of data objects which can be treated as one subject in the system. Each bag has an owner, which is established when the bag is created. A data object lives in exactly one bag, but may be moved across bags by the owner of the bag. Only the owner can create new data objects in a bag, or opt into absorbing objects from another bag.
The purpose of the concept of bags is to limit the on-chain footprint of administrating multiple objects which should be treated the same way. This is achieved by establishing a small immutable identifier for these objects. The canonical example would be assets that will be consumed together, such as the cover photo and different video media encodings of a single piece of video content. Storage and distribution nodes have commitments to bags, not individual data objects.
There are two different kinds of bags, static bags and dynamic bags. The former are all created when the system goes live and cannot be deleted, and of the latter there are few different types, and new instances of each type can be created over time. Specifically there is one static bag for the council and each working group, and there is a member, channel and DAO dynamic bag type. When a new member, channel or DAO is created, then a new instance of each such type is created.
A dynamic bag creation policy holds parameter values impacting how exactly the creation of a new dynamic bag occurs, and there is one such policy for each type of dynamic bag. It describes how many storage buckets should store the bag, and from what subset of distribution bucket families (described below) to select a given number of distribution buckets (described below).
Storage Buckets and Vouchers
A storage bucket is a commitment to hold some set of bags for long term storage. A bucket may have a bucket operator, which is a single worker in the storage working group. There is distinct bucket operator metadata associated with each, which describes things such as how to resolve the host. The operator of a bucket may change over time. As previously described, when new dynamic bags are created, they are allocated to one or more such buckets, unless the bucket has been temporarily disabled from accepting new bags. A bucket also has a voucher, which describes the limits on how much data and the number of objects can be assigned to the bucket, across all bags, as well as current usage. This limitation is enforced when data objects are uploaded or moved. The voucher limits can be updated by the
Distribution Bucket Families and Distribution Buckets
A distribution bucket is a commitment to distribute a set of bags to end users. A bucket may have multiple bucket operators, each being a worker in the distribution working group. The same metadata concept applies here as well, and additionally covers whether the operator is live or not. Bags are assigned to buckets when being uploaded, or later by the lead by manual intervention. Buckets are partitioned into so called distribution bucket families. These families group buckets with interchangeable semantics from distributional point of view, and the purpose of the grouping is to allow sharding over the bag space for a given service level when creating new bags. Here is an example that can make this more clear. A subset of families could for example represent each country in East Asia, where each family corresponds to a specific country. The buckets in a family, say the family for Mongolia, will be operated by infrastructure which can provide sufficiently low latency guarantees w.r.t. the corresponding country. The bag for a channel known to be particularly popular in this area could be setup so as to use these buckets disproportionately.
Utililisation Model
See here #2359
Runtime:
storage_and_distribution
The subsystem specific functionality will be introduced in a new single module, called
storage_and_distribution
, sketched out below.All user level actions must come from within the runtime, and each client module must be able to enforce its own authorisation of user transactions using local state about ownership and possible extraneous utilisation limits.
Unlike many other modules there is quite frequent use of inlined storage and state representations in order to facilitate database access free iteration which is unavoidable to unlock the sort of computational efficiency required for key operations, in particular deletion and assigning new bags to buckets.
Concepts
Module Account
A dedicated module account should be introduced for holding the funds that incentivise cleaning up stale state objects.
Constants
State
Extrinsics
Storage Node:
Colossus
Here is a list of requirements for the storage node reference implementation.
Distributor Node:
Argus
Here is a list of requirements for the storage node reference implementation.
Joystream CLI
New commands for
Colossus
node operator.Argus
node operator.The text was updated successfully, but these errors were encountered: