Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update en 09 doc #895

Merged
merged 4 commits into from
Aug 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
232 changes: 227 additions & 5 deletions docs/en/09.poster-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,235 @@

In earlier versions, although `damocles-manager` supports using `--poster` and `--miner` parameters of the `daemon run` command to enable the corresponding module, the `post` proof process is still of strong correlation with sector location information which makes it more limited and difficult to expand.

From v0.2.0 onwards, we have provided a series of functional combinations that make easy-to-use, scalable standalone PoSter nodes an option for `SP` with large-scale operations.
From v0.2.0 onwards, we have provided a series of function combinations that make easy-to-use, scalable standalone PoSter nodes an option for `SP` of **large-scale operations** and **multiple miner ids**.

Below, we will introduce these new features and provide a practice to complete the deployment of standalone PoSter nodes using these features. Subsequent documents use the node with `--poster` enabled as an example, and the standalone `--miner` node operates in a similar manner, which will not be described separately.

---

From version v0.8.0 and onwards, damocles supports three ways to run PoSter nodes independently, namely worker-prover mode, proxy node mode, and ext-prover mode (external executor mode).

## worker-prover mode
The worker-prover mode is a new feature of v0.8.0. It is characterized by simplicity and can support multi-machine wdpost (with coordination and redundancy) very easily.

### Fundamental
The worker-prover mode uses damocles-worker to compute the window post proof, obtains the window post task from damocles-manager through RPC and returns the computation result.

damocles-worker adds wdpost planner for executing window post tasks.
#### Architecture
```
+-----------------------------------+
| damocles-manager daemon |
| with --worker-prover falg |
| |
| +-----------------+ |
| |damocles-manager | |
| | poster module | |
| +-------+-^-------+ |
| send | |recv |
| | | |
| +-------v-+-------+ |
| | worker-prover | |
+--------+--------> module <--------+--------+
| | +--------^--------+ | |
| | | | |
| +-----------------+-----------------+ |
| | |
-------+--------------------------+--------------------------+------------
| | |
| pull job | pull job | pull job
| push res | push res | push res
| by rpc | by rpc | by rpc
| | |
+------+--------+ +-------+-------+ +------+--------+
|damocles-worker| |damocles-worker| |damocles-worker|
|wdpost planner | |wdpost planner | ... |wdpost planner |
+---------------+ +---------------+ +---------------+
```

### Damocles-manager Configuration and Startup

New configuration:
```toml
# ~/.damocles-manager/sector-manager.cfg

#...

[Common.Proving.WorkerProver]
# The maximum number of attempts of the WindowPoSt task, optional, number type
# Default value is 2
# The WindowPoSt task whose number of attempts exceeds JobMaxTry can only be re-executed by manually resetting
JobMaxTry = 2
# WindowPoSt task heartbeat timeout, optional, time string type
# Default value is 15s
# Tasks that have not sent a heartbeat for more than this amount of time will be set as failed and retried
HeartbeatTimeout = "15s"
# WindowPoSt task timeout, optional, time string type
# Default value is 25h
# WindowPoSt tasks whose creation time exceeds this time will be deleted
JobLifetime = "25h0m0s"

#...
```

Start the damocles-manager process:
```sh
# --miner flag is optional to add, which means to start the miner module to execute WinningPoSt and produce blocks
# --poster flag must be added, which means to start the WindowPoSt module
# --worker-prover must be added, indicating that the WorkerProver module is used to execute WindowPoSt
./damocles-manager daemon run --miner --poster --worker-prover
```

### damocles-worker configuration
Configuration walkthrough:
```toml
[[sealing_thread]]
# Configure to use wdpost plan
plan = "wdpost"
# The configuration limits the execution of the task to the specified miner id; if it is left empty, it means no limit
# sealing.allowed_miners = [6666, 7777]
# Configure tasks that only allow sectors of the specified size to run
# allowed_sizes = ["32GiB", "64GiB"]

[[attached]]
# Configure the permanent storage that this worker will use during the execution of the window post task
name = "miner-6666-store"
location = "/mnt/miner-6666-store"


# Control window_post task concurrency (optional), no limit if not configured
[processors.limitation.concurrent]
window_post = 2

[[processors. window_post]]
# Use a custom wdpost proof (optional), if you do not configure bin, the built-in proof will be used by default
bin="~/my_algorithm"
args = ["window_post"]
# Configure environment variables for custom proof (optional)
envs = { BELLMAN_GPU_INDEXS="0", CUDA_VISIBLE_DEVICES="0", ... }
# Configure the maximum concurrent number of this process (optional), no limit if not configured
concurrent = 1
```

##### A simple example configuration to start just one wdpost sealing_thread is as follows:

```toml
# /path/to/your-damocles-worker-config.toml

[worker]
name = "damocles-worker-USA-01"

[sector_manager]
rpc_client.addr = "/ip4/your-damocles-manager-address-here/tcp/1789"

[[sealing_thread]]
plan = "wdpost"
# The time interval for trying to claim tasks, the default is 60s,
# For wdpost plan, we can reduce this value to get new wdpost tasks faster
sealing.recover_interval = "15s"
# sealing.allowed_miners = [6666]
# sealing. allowed_sizes = ["32GiB"]
#...

[[attached]]
name = "miner-6666-store"
location = "/mnt/miner-6666-store"
```

##### An example of a wdpost machine equipped with two graphics cards
```toml
# /path/to/your-damocles-worker-config.toml

[worker]
name = "damocles-worker-USA-01"

[sector_manager]
rpc_client.addr = "/ip4/your-damocles-manager-address-here/tcp/1789"

[[sealing_thread]]
plan = "wdpost"
sealing.recover_interval = "15s"
sealing.allowed_miners = [6666]
#...

[[sealing_thread]]
plan = "wdpost"
sealing.recover_interval = "15s"
sealing.allowed_miners = [7777]
#...

[[attached]]
name = "miner-6666-store"
location = "/mnt/miner-6666-store"

[[attached]]
name = "miner-7777-store"
location = "/mnt/miner-7777-store"

# -------------------------

[[processors. window_post]]
# bin="~/my_algorithm"
# args = ["window_post", ...]
envs = { ... }
concurrent = 2

# ----------- or ---------

#[[processors. window_post]]
# bin="~/my_algorithm"
# args = ["window_post", ...]
# envs = { CUDA_VISIBLE_DEVICES="0", ... }
# concurrent = 1

# [[processors. window_post]]
# bin="~/my_algorithm"
# args = ["window_post"]
# envs = { CUDA_VISIBLE_DEVICES="1", ... }
# concurrent = 1
```

When damocles-worker is running wdpost plan, it is not necessary to use the `damocles-worker store sealing-init -l` command to initialize the local storage directory of the data.


### Manage window post tasks
- #### Show window post task list
```sh
# By default, unfinished tasks and failed tasks are displayed, where the DDL field represents the deadline Index of the task, and the Try field is the number of attempts of the task
./damocles-manager util worker wdpost list

JobID MinerID DDL Partitions Worker State Try CreateAt Elapsed Heartbeat Error
3FgfEnvrub1 1037 3 1,2 10.122.63.30 ReadyToRun 1 07-27 16:37:31 - -
gbCVH4TUgEf 1037 2 1,2 ReadyToRun 0 07-27 16:35:56 - -
CrotWCLaXLa 1037 1 1,2 10.122.63.30 Succeed 1 07-27 17:19:04 6m38s(done) -

# show all tasks
./damocles-manager util worker wdpost list --all
#...

# show window post task details
./damocles-manager util worker wdpost list --detail
#...
```

- #### reset task
When the execution of the window post task fails and the number of automatic retries reaches the limit, the task status can be manually reset so that it can continue to be picked up and executed by damocles-worker.
```sh
./damocles-manager util worker wdpost reset gbCVH4TUgEf 3FgfEnvrub1
```

- #### delete task
Deleting a task is similar to resetting a task. When the command to delete a task is executed, the retry mechanism of damocles-manager will detect whether the window post task of the current deadline exists in the database, if not, it will resend the task and record it in the database.

In addition, worker-prover will automatically delete tasks that have been created for more than a certain period of time (the default is 25 hours, and the time is configurable).
```sh
# Delete the specific task
./damocles-manager util worker wdpost remove gbCVH4TUgEf 3FgfEnvrub1

# delete all tasks
./damocles-manager util worker wdpost remove-all --really-do-it
```

## Proxy node mode
We know that for PoSter nodes, the most important capability is to obtain real-time and accurate sector location information. In the current `damocles-manager` version, we only provide metadata management based on the local embedded kv database (more to be supported).

Expand Down Expand Up @@ -166,7 +391,4 @@ There is no conflict between `winningPost` and `windowPost` due to device usage
So far, we have described the functions, principles and simple usage examples that stand-alone `PoSter` nodes rely on.

However, this mode still has some limitations for very large `SP` clusters, which may manifest in:
- Unless the configuration is split, `PoSter` node can only provide `PoSt` support for some miners, it is difficult to provide horizontal scalability across machines;
- The scheduling of the `PoSt` and the serious conflict in the `PoSt` window period still relies on the operation and maintenance to a certain extent;

In general, the above limitations rely on a fully state decoupled, distributed `damocles-manager`implementation, which is one of the directions we focus on in the future.
- The scheduling of the `PoSt` and the serious conflict in the `PoSt` window period still relies on the operation and maintenance to a certain extent;
2 changes: 1 addition & 1 deletion docs/zh/09.独立运行的poster节点.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ JobMaxTry = 2
# 默认值为 15s
# 超过此时间没有发送心跳的任务将会被设置为失败并重试
HeartbeatTimeout = "15s"
# WindowPoSt 任务的心跳超时时间, 可选项, 时间字符串类型
# WindowPoSt 任务的过期时间, 可选项, 时间字符串类型
# 默认值为 25h
# 创建时间超过此时间的 WindowPoSt 任务将会被删除
JobLifetime = "25h0m0s"
Expand Down