Skip to content

Commit

Permalink
add breakpoint_recovery
Browse files Browse the repository at this point in the history
  • Loading branch information
better629 committed Dec 1, 2023
1 parent f01e0e9 commit 276171e
Show file tree
Hide file tree
Showing 2 changed files with 173 additions and 0 deletions.
Empty file.
173 changes: 173 additions & 0 deletions src/zhcn/guide/tutorials/breakpoint_recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# 断点恢复

## 定义
断点恢复指在程序运行过程中,记录程序不同模块的产出并落盘,当程序碰到外部如`Ctrl-C`或内部执行异常如LLM Api网络异常导致退出等情况时。再次执行程序,能够从中断前的结果中恢复继续执行,而无需从0到1开始执行,降低开发者的时间和费用成本。

## 序列化与反序列化
为了能支持断点恢复操作,需要对程序中的不同模块产出进行结构化存储即序列化的过程,保存后续用于恢复操作的现场。序列化的操作根据不同模块的功能有所区分,比如角色信息,初始化后即可以进行序列化,过程中不会发生改变。记忆信息,需要执行过程中,实时进行序列化保证完整性(序列化耗时在整个程序执行中的占比很低)。

## 实现逻辑

### 可能产生中断的情况

- 网络等问题,LLM-Api调用重试多次后仍失败
- Action执行过程中,输出内容解析失败导致退出
- 人为的`Ctrl-C`对程序进行中断

### 序列化存储结构
为了清晰化**整体项目的结构信息**,采用层级化的方式进行内容的序列化存储。

当程序发生中断后,对应不同模块在存储目录下的文件结构如下:
结构概要
```bash
./workspace
storage
team
team_info.json # 团队需求、预算等信息
environment # 环境
memory.json # 环境记忆
roles.json # 团队内的角色基础信息
history.json # 历史信息
roles # 团队内角色
RoleA_Alice # 具体一个角色
memory.json # 角色记忆
role_info.json # 包括角色身份、执行动作、监听动作等信息
```

每个`xxx.json`下的为对应内容的数据概要示例。
```bash
./workspace
storage
team
team_info.json # investment and so on
{
"investment": 10.0,
"idea": "write a snake game"
}
environment
roles.json # roles' meta info
[
{
"role_class": "RoleA",
"module_name": "tests.metagpt.serialize_deserialize.test_serdeser_base",
"role_name": "RoleA"
},
{
"role_class": "RoleB",
"module_name": "tests.metagpt.serialize_deserialize.test_serdeser_base",
"role_name": "RoleB"
}
]
memory.json
{
"storage": [
],
"index": {
}
}
history.json
{
"content": "\nHuman: write a snake game\nRole A: ActionPass run passed\nHuman: write a snake game"
}
roles
ProductManager_Alice
memory.json
{
"storage": [
],
"index": {
}
}
role_info.json # RoleSetting _actions _states
{
"name": "RoleA",
"profile": "Role A",
"goal": "RoleA's goal",
"constraints": "RoleA's constraints",
"desc": "",
"is_human": false,
"recovered": true,
"builtin_class_name": "RoleA",
"_role_id": "RoleA(Role A)",
"_states": [
"0. <class 'tests.metagpt.serialize_deserialize.test_serdeser_base.ActionPass'>"
],
"_actions": [
],
"_rc": {
},
"role_class": "RoleA",
"module_name": "tests.metagpt.serialize_deserialize.test_serdeser_base"
}
```

### 恢复时的执行顺序
由于MetaGPT是异步执行框架,对于下述几种典型的中断截点和恢复顺序。

1. 角色A(一个action)-> 角色B(2个action),角色A进行action选择时出现异常退出。
2. 角色A(一个action)-> 角色B(2个action),角色B第1个action执行正常,第2个action执行时出现异常退出。
#### 情况1
执行入口重新执行后,各模块进行反序列化。角色A未观察到属于自己处理的Message,不处理。角色B恢复后,观察到一条之前未处理完毕的Message,则在`_observe`后重新执行对应的`react`操作,按react策略执行对应2个动作。

#### 情况2
执行入口重新执行后,各模块进行反序列化。角色A未观察到属于自己处理的Message,不处理。角色B恢复后,`_observe`到一条之前未完整处理完毕的Message,在`react`中,知道自己在第2个action执行失败,则直接从第2个action开始执行。


### 从中断前的Message开始重新执行
一般来说,Message是不同角色间沟通协作的桥梁,当在Message的执行过程中发生中断后,由于该Message已经被该角色存入环境(Environment)记忆(Memory)中。在进行恢复中,如果直接加载环境内的全部Memory,该角色的`_observe`将不会观察到中断时引发当时执行`Message`,从而不能恢复该Message的继续执行。
因此,为了保证该Message在恢复时能够继续执行,需要在发生中断后,从角色记忆中删除对应的该条Message。

### 从中断前的Action开始重新执行
一般来说,Action是一个相对较小的执行模块粒度,当在Action的执行过程中发生中断后,需要知道多个Actions的执行顺序以及当前执行到哪个Action。当进行恢复时,定位到中断时的Action位置并重新执行该Action。

## 结果

### 断点恢复入口
`python3 starup.py "xxx" --recover_path "./workspace/storage/team"` # 默认序列化到`./workspace/storage/team`中。

### 恢复后继续执行结果
`python3 -s tests/metagpt/serialize_deserialize/test_team.py``test_team_recover_multi_roles_save`的执行case

`RoleB``ActionRaise`模拟Action异常,执行到该Action时发生异常,序列化项目后退出。 重新启动后,`RoleA`已经执行过,不继续执行。`RoleB``ActionOK`已经执行过,不继续执行。继续从`ActionRaise`执行,仍异常。

```bash
2023-11-30 20:41:22.313 | DEBUG | metagpt.team:run:92 - n_round=3
2023-11-30 20:41:22.314 | DEBUG | metagpt.roles.role:_observe:389 - RoleA(Role A) observed: ['Human: write a snake game...']
2023-11-30 20:41:22.314 | DEBUG | metagpt.roles.role:_set_state:316 - [ActionPass]
2023-11-30 20:41:22.314 | DEBUG | metagpt.roles.role:_react:412 - RoleA(Role A): self._rc.state=0, will do ActionPass
2023-11-30 20:41:22.314 | INFO | metagpt.roles.role:_act:361 - RoleA(Role A): ready to ActionPass
2023-11-30 20:41:22.315 | DEBUG | metagpt.roles.role:run:472 - RoleB(Role B): no news. waiting.
2023-11-30 20:41:27.322 | DEBUG | metagpt.roles.role:_set_state:316 - [ActionPass]
2023-11-30 20:41:27.322 | DEBUG | metagpt.team:run:92 - n_round=2
2023-11-30 20:41:27.323 | DEBUG | metagpt.roles.role:run:472 - RoleA(Role A): no news. waiting.
2023-11-30 20:41:27.324 | DEBUG | metagpt.roles.role:_observe:389 - RoleB(Role B) observed: ['Role A: ActionPass run passe...']
2023-11-30 20:41:27.325 | DEBUG | metagpt.roles.role:_set_state:316 - [ActionOK, ActionRaise]
2023-11-30 20:41:27.325 | INFO | metagpt.roles.role:_act:361 - RoleB(Role B): ready to ActionOK
2023-11-30 20:41:32.327 | DEBUG | metagpt.roles.role:_set_state:316 - [ActionOK, ActionRaise]
2023-11-30 20:41:32.328 | INFO | metagpt.roles.role:_act:361 - RoleB(Role B): ready to ActionRaise
2023-11-30 20:41:32.329 | WARNING | metagpt.utils.utils:wrapper:82 - There is a exception in role's execution, in order to resume, we delete the newest role communication message in the role's memory.
2023-11-30 20:41:32.331 | ERROR | metagpt.utils.utils:wrapper:61 - Exception occurs, start to serialize the project, exp:
Traceback (most recent call last):
...
File "/Users/xxxx/work/code/MetaGPT/metagpt/roles/role.py", line 362, in _act
response = await self._rc.todo.run(self._rc.important_memory)
File "/Users/xxxx/work/code/MetaGPT/tests/metagpt/serialize_deserialize/test_serdeser_base.py", line 50, in run
raise RuntimeError("parse error in ActionRaise")
RuntimeError: parse error in ActionRaise

############################# --------- 此处开始重新执行 ----------- ############################
2023-11-30 20:41:32.351 | DEBUG | metagpt.team:run:92 - n_round=3
2023-11-30 20:41:32.351 | DEBUG | metagpt.roles.role:run:472 - RoleA(Role A): no news. waiting.
2023-11-30 20:41:32.352 | DEBUG | metagpt.roles.role:_observe:389 - RoleB(Role B) observed: ['Role A: ActionPass run passe...']
2023-11-30 20:41:32.352 | DEBUG | metagpt.roles.role:_set_state:316 - [ActionOK, ActionRaise]
2023-11-30 20:41:32.352 | INFO | metagpt.roles.role:_act:361 - RoleB(Role B): ready to ActionRaise
2023-11-30 20:41:32.353 | WARNING | metagpt.utils.utils:wrapper:82 - There is a exception in role's execution, in order to resume, we delete the newest role communication message in the role's memory.
2023-11-30 20:41:32.353 | ERROR | metagpt.utils.utils:wrapper:61 - Exception occurs, start to serialize the project, exp:
Traceback (most recent call last):
...
File "/Users/xxxx/work/code/MetaGPT/metagpt/roles/role.py", line 362, in _act
response = await self._rc.todo.run(self._rc.important_memory)
File "/Users/xxxx/work/code/MetaGPT/tests/metagpt/serialize_deserialize/test_serdeser_base.py", line 50, in run
raise RuntimeError("parse error in ActionRaise")
RuntimeError: parse error in ActionRaise
```

0 comments on commit 276171e

Please sign in to comment.