-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clear empty thread when graph destroy #7633
Changes from 5 commits
5feb16b
7e3f93a
458e678
2116953
de27358
08b4104
87cc7f0
9767719
1b7224b
1e9e7a1
f1b6d1a
545827c
57ea016
7f2afb2
26f5417
415ec63
803d55c
190ab62
ac22e06
2a3742e
207afe3
0f895c9
8216eac
63cb89f
3a169da
ec52eba
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -135,13 +135,22 @@ Maybe<void> MultiClientSessionContext::TryClose() { | |
if (is_inited_) { | ||
VLOG(2) << "Try to delete multi client session context." << std::endl; | ||
|
||
for (auto wk_graph_ptr : graphs_) { | ||
if (auto sh_graph_ptr = wk_graph_ptr.lock()) { | ||
VLOG(2) << "grap name " << sh_graph_ptr->job_name() << " not closed, try to close it."; | ||
JUST(sh_graph_ptr->Close()); | ||
} | ||
} | ||
|
||
/* | ||
// sync before NNGraph release to ensure LaunchLazyJob instruction was completed and released | ||
JUST(vm::ClusterSync()); | ||
for (const auto& graph : graphs_) { | ||
VLOG(2) << "Try to close graph: " << graph->job_name() << std::endl; | ||
JUST(graph->Close()); | ||
} | ||
graphs_.clear(); | ||
*/ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这段删除掉? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 好的,当时调试用的,忘了清了 |
||
{ | ||
// NOTE(chengcheng): delete runtime global objects | ||
Global<boxing::collective::Scheduler>::Delete(); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -51,7 +51,7 @@ class MultiClientSessionContext { | |
bool is_inited_; | ||
HashMap<std::string, std::vector<std::pair<std::string, std::shared_ptr<one::Tensor>>>> | ||
graph_name2free_eager_tensors_; | ||
std::vector<std::shared_ptr<NNGraph>> graphs_; | ||
std::vector<std::weak_ptr<NNGraph>> graphs_; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这行代码的修改还需要在增加一些保护措施。因为现在 Graph 有可能不是在 main 线程析构 ,而是在指令释放的 Callback 线程中,因此 可能会发生 Graph 析构 顺序 与 Session 析构顺序不严格有序的情况(Session 早于 Graph 析构,就会报错) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A→B表示A持有B的shared_ptr,以体现资源依赖关系;A ⇢ B表示A持有B的weak_ptr,以体现释放时的同步等待关系。
py env 、py session、py graph在python层通过python的引用计数,保证了python对象的正确析构顺序。 c env 对 c session、c session对 c graph、c graph对vm那边则资源寻求bc,在析构时会做一遍检查,发现依赖自己的对象还没析构,就用weak ptr去同步等待依赖它的对象完成释放。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 有明确的办法,可以看我最新的 commit 2a3742e |
||
}; | ||
|
||
} // namespace oneflow | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,16 +17,13 @@ limitations under the License. | |
#include "oneflow/core/job/resource_desc.h" | ||
#include "oneflow/core/job/global_for.h" | ||
#include "oneflow/core/control/global_process_ctx.h" | ||
#include "oneflow/core/job/global_for.h" | ||
|
||
namespace oneflow { | ||
|
||
ThreadMgr::~ThreadMgr() { | ||
for (auto& thread_pair : threads_) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. check(thread_pair.size() == 0) ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 是,直接 check 就行。原本想的是报错时把 thread 的信息打出来。但 FATAL 也没法遍历所有的 thread 了 |
||
ActorMsg msg = ActorMsg::BuildCommandMsg(-1, ActorCmd::kStopThread); | ||
thread_pair.second->GetMsgChannelPtr()->Send(msg); | ||
thread_pair.second.reset(); | ||
LOG(INFO) << "actor thread " << thread_pair.first << " finish"; | ||
LOG(FATAL) << " Runtime Error! thread id " << thread_pair.first | ||
<< " not delete with graph, and it's empty is " << thread_pair.second->Empty(); | ||
} | ||
} | ||
|
||
|
@@ -36,20 +33,42 @@ Thread* ThreadMgr::GetThrd(int64_t thrd_id) { | |
return iter->second.get(); | ||
} | ||
|
||
void ThreadMgr::AddPlan(const Plan& plan) { | ||
void ThreadMgr::AddThreads(const HashSet<int64_t>& thread_ids) { | ||
const int64_t this_rank = GlobalProcessCtx::Rank(); | ||
for (const TaskProto& task : plan.task()) { | ||
TaskId task_id = DecodeTaskIdFromInt64(task.task_id()); | ||
StreamId stream_id = task_id.stream_id(); | ||
for (int64_t thrd_id : thread_ids) { | ||
const auto& it = threads_.find(thrd_id); | ||
if (it != threads_.end()) { | ||
// NOTE(chengcheng): check thread is not null. | ||
CHECK(it->second); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 可以增加一点注释,现在遇到check后面没有日志的信息,比较难定位,需要再加上日志; There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 好的 |
||
continue; | ||
} | ||
StreamId stream_id = DecodeStreamIdFromInt64(thrd_id); | ||
if (stream_id.rank() != this_rank) { continue; } | ||
int64_t thrd_id = EncodeStreamIdToInt64(stream_id); | ||
if (threads_.find(thrd_id) != threads_.end()) { continue; } | ||
Thread* thread = new Thread(stream_id); | ||
CHECK_NOTNULL(thread); | ||
threads_[thrd_id].reset(thread); | ||
} | ||
} | ||
|
||
void ThreadMgr::TryDeleteThreads(const HashSet<int64_t>& thread_ids) { | ||
for (int64_t thrd_id : thread_ids) { | ||
const auto& it = threads_.find(thrd_id); | ||
if (it == threads_.end()) { continue; } | ||
auto& thread = it->second; | ||
CHECK(thread) << " actor thread " << thrd_id << " non-existent but want to delete"; | ||
if (thread->Empty()) { | ||
// NOTE(chengcheng): Only delete thread when it is empty. | ||
ActorMsg msg = ActorMsg::BuildCommandMsg(-1, ActorCmd::kStopThread); | ||
thread->GetMsgChannelPtr()->Send(msg); | ||
thread.reset(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 不需要等线程结束吗? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这里的逻辑跟之前 thread 下线的逻辑一致。 我的理解是这样的: 57行,发送 msg 会让 thread 的 actor poll msg channel 线程跳出循环(异步的)。 58 行,执行 thread 的析构函数,thread 析构时调用 actor_thread_.join(), 这里会同步等待线程结束吧。 所以这里还是会等到线程结束才会销毁 thread There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 好的。还是应该在这一行之前注释一下析构函数里的语义,减少读者的惊讶。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 已加注释 |
||
LOG(INFO) << " actor thread " << thrd_id << " finish."; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 感觉VLOG(2)比较好 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 可以 |
||
threads_.erase(it); | ||
} else { | ||
LOG(INFO) << " actor thread " << thrd_id << " not delete because it is not empty."; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 感觉LOG(Warning)比较好? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 可以 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 想了一下,warning 可能不太好。因为这个可能是正常现象。比如 train eval,其中一个 graph |
||
} | ||
} | ||
} | ||
|
||
void SingleThreadLoop(size_t num, std::function<void(size_t i)> Callback) { | ||
FOR_RANGE(size_t, i, 0, num) { Callback(i); } | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto&
不要养成用auto的习惯啊。基本上有三种方式替代:const auto&, auto*, auto&&,for循环的迭代变量还可以用auto&。所有的智能指针应该用const auto&
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已改,之前是回滚的啸宇的代码,以后都注意使用 const &