Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand #7702

Merged
merged 96 commits into from
Apr 8, 2022
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
b8f7f89
run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand
lixinqi Mar 6, 2022
af55be9
Merge branch 'master' into fix_barrier_instruction
lixinqi Mar 6, 2022
1b3c3df
Merge branch 'master' into fix_barrier_instruction
lixinqi Mar 7, 2022
516c691
Merge branch 'master' into fix_barrier_instruction
lixinqi Mar 7, 2022
9095397
lock gil in vm Callback thread
lixinqi Mar 7, 2022
c3846cd
Merge branch 'fix_barrier_instruction' of github.com:Oneflow-Inc/onef…
lixinqi Mar 7, 2022
969ba8c
more comments for VirtualMachineEngine::Callback()
lixinqi Mar 8, 2022
d406a30
Merge branch 'master' into fix_barrier_instruction
Flowingsun007 Mar 8, 2022
c7c6c67
Merge branch 'master' into fix_barrier_instruction
Flowingsun007 Mar 8, 2022
65db196
Merge branch 'master' into fix_barrier_instruction
Flowingsun007 Mar 8, 2022
fca7b8d
Merge branch 'master' into fix_barrier_instruction
Flowingsun007 Mar 9, 2022
b049dfc
Merge branch 'master' into fix_barrier_instruction
Flowingsun007 Mar 9, 2022
7ed5d66
Merge branch 'master' into fix_barrier_instruction
Flowingsun007 Mar 10, 2022
29ccd30
the Env is never destroyed.
lixinqi Mar 13, 2022
8509484
export Env into python
lixinqi Mar 14, 2022
434b7af
more unittests
lixinqi Mar 14, 2022
3925eb4
wait shared_ptr.use_count() == 0
lixinqi Mar 16, 2022
86296cb
export unittest.TestCase in framework/unittest.py
lixinqi Mar 16, 2022
454f5e7
SwitchToShuttingDownPhase
lixinqi Mar 16, 2022
d1d9ad7
optional is_normal_exit
lixinqi Mar 16, 2022
a58348d
VirtualMachine::CloseVMThreads
lixinqi Mar 17, 2022
8bb83a1
merge master
lixinqi Mar 18, 2022
fe64379
Delete env_api.h
lixinqi Mar 18, 2022
9ca83c6
reshape_only_one_dim_infered
lixinqi Mar 21, 2022
9b5cecc
Merge branch 'master' into export_env_to_python
lixinqi Mar 22, 2022
f7f1fdf
merge branch export_env_to_python
lixinqi Mar 22, 2022
72426fc
address pr comments
lixinqi Mar 22, 2022
3be6f81
fix a ref-cnt bug in TryRunBarrierInstruction.
chengtbf Mar 23, 2022
9f10675
Merge branch 'export_env_to_python' into fix_barrier_instruction
lixinqi Mar 23, 2022
da8b44d
Merge branch 'master' of github.com:Oneflow-Inc/oneflow into export_e…
chengtbf Mar 23, 2022
c52bc90
Merge branch 'export_env_to_python' of github.com:Oneflow-Inc/oneflow…
chengtbf Mar 23, 2022
cf14a1e
rollback flow.env.all_device_placement
chengtbf Mar 23, 2022
e3297f4
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 23, 2022
6f5d7c6
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 23, 2022
5d0c648
Merge branch 'master' into export_env_to_python
lixinqi Mar 24, 2022
97cf982
no distributed running test_shutting_down.py
lixinqi Mar 24, 2022
a119bf0
auto format by CI
oneflow-ci-bot Mar 24, 2022
d2b36a4
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 24, 2022
d8862b9
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 24, 2022
6e22eb4
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 25, 2022
e58bee9
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Mar 25, 2022
df91894
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 25, 2022
93191d7
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 25, 2022
da35647
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 25, 2022
a3d45e3
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 26, 2022
1ec6dc3
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 26, 2022
3d6c7cd
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 26, 2022
f7e2241
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 26, 2022
c1afee6
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 26, 2022
7c4cb89
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 26, 2022
7daab5a
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 27, 2022
07bb2a2
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 27, 2022
47738f7
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 27, 2022
87c748f
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 27, 2022
9d5ab85
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Mar 28, 2022
80a4542
Merge branch 'master' into export_env_to_python
strint Mar 28, 2022
57fff70
Merge branch 'master' into export_env_to_python
lixinqi Mar 28, 2022
ce55514
Merge branch 'export_env_to_python' of github.com:Oneflow-Inc/oneflow…
lixinqi Mar 28, 2022
c679884
expand lifetime of module oneflow in test_shutting_down.py
lixinqi Mar 28, 2022
53d088a
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 28, 2022
c6ecc12
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 28, 2022
204aa54
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 28, 2022
3124ecf
merge master
lixinqi Mar 29, 2022
8c6c03e
Merge branch 'export_env_to_python' of github.com:Oneflow-Inc/oneflow…
lixinqi Mar 29, 2022
72c30fd
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 29, 2022
5d6212f
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 29, 2022
19c18d4
refine del depend on of
strint Mar 29, 2022
002bacf
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 29, 2022
5cfcadd
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 29, 2022
aef2edc
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 29, 2022
b2d87f2
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 30, 2022
f00f991
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 30, 2022
ae9d601
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 30, 2022
89f3ce2
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 30, 2022
ad18575
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 30, 2022
b67a60e
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 30, 2022
fb8b9fa
Merge branch 'master' into export_env_to_python
lixinqi Mar 31, 2022
ec2c402
Merge branch 'export_env_to_python' of github.com:Oneflow-Inc/oneflow…
lixinqi Mar 31, 2022
45fd613
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 31, 2022
b730ef0
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 31, 2022
fbd921c
Merge branch 'master' into export_env_to_python
mergify[bot] Mar 31, 2022
41c90aa
Merge branch 'master' into export_env_to_python
lixinqi Apr 1, 2022
31538de
Merge branch 'export_env_to_python' into fix_barrier_instruction
lixinqi Apr 3, 2022
5fdba64
merge master
lixinqi Apr 3, 2022
1f91eaa
Merge branch 'master' into fix_barrier_instruction
strint Apr 5, 2022
6996258
Merge branch 'master' into fix_barrier_instruction
lixinqi Apr 6, 2022
f6c0b99
capture oneflow._oneflow_internal.eager when calling sync in __del__
lixinqi Apr 6, 2022
0fbfc80
Merge branch 'fix_barrier_instruction' of github.com:Oneflow-Inc/onef…
lixinqi Apr 6, 2022
22f220b
Merge branch 'master' into fix_barrier_instruction
mergify[bot] Apr 6, 2022
bcc1d8d
Merge branch 'master' into fix_barrier_instruction
mergify[bot] Apr 6, 2022
aec7634
Merge branch 'master' into fix_barrier_instruction
mergify[bot] Apr 6, 2022
c1010a1
Merge branch 'master' into fix_barrier_instruction
mergify[bot] Apr 7, 2022
44196a8
Merge branch 'master' into fix_barrier_instruction
mergify[bot] Apr 7, 2022
255c874
Merge branch 'master' into fix_barrier_instruction
chengtbf Apr 8, 2022
cd80f0c
Merge branch 'master' into fix_barrier_instruction
mergify[bot] Apr 8, 2022
34af282
add try in flaky test
strint Apr 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions oneflow/core/framework/instructions_builder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ limitations under the License.
#include "oneflow/core/common/decorator.h"
#include "oneflow/core/common/blocking_counter.h"
#include "oneflow/core/rpc/include/global_process_ctx.h"
#include "oneflow/core/vm/no_arg_cb_phy_instr_operand.h"
#include "oneflow/core/vm/barrier_phy_instr_operand.h"
#include "oneflow/core/vm/access_blob_arg_cb_phy_instr_operand.h"
#include "oneflow/core/vm/consume_local_dep_object_phy_instr_operand.h"
#include "oneflow/core/eager/release_tensor_arg_phy_instr_operand.h"
Expand Down Expand Up @@ -576,7 +576,7 @@ template Maybe<void> InstructionsBuilder::AccessBlobByCallback(

Maybe<void> InstructionsBuilder::ComputeRankFrontSeqCallback(
const std::function<void()>& callback) {
const auto& phy_instr_operand = std::make_shared<vm::NoArgCbPhyInstrOperand>(callback);
const auto& phy_instr_operand = std::make_shared<vm::BarrierPhyInstrOperand>(callback);
auto instruction = intrusive::make_shared<vm::InstructionMsg>(
Global<VirtualMachine>::Get()->mut_vm(), "ComputeRankFrontSeqCallback",
std::shared_ptr<const ParallelDesc>(), phy_instr_operand);
Expand All @@ -585,7 +585,7 @@ Maybe<void> InstructionsBuilder::ComputeRankFrontSeqCallback(
}

Maybe<void> InstructionsBuilder::ComputeGlobalFrontSeqBarrier() {
const auto& phy_instr_operand = std::make_shared<vm::NoArgCbPhyInstrOperand>([] {});
const auto& phy_instr_operand = std::make_shared<vm::BarrierPhyInstrOperand>([] {});
auto instruction = intrusive::make_shared<vm::InstructionMsg>(
Global<VirtualMachine>::Get()->mut_vm(), "ComputeGlobalFrontSeqBarrier",
std::shared_ptr<const ParallelDesc>(), phy_instr_operand);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
#ifndef ONEFLOW_CORE_VM_NO_ARG_CB_PHY_INSTR_OPERAND_H_
#define ONEFLOW_CORE_VM_NO_ARG_CB_PHY_INSTR_OPERAND_H_
#ifndef ONEFLOW_CORE_VM_BARRIER_PHY_INSTR_OPERAND_H_
#define ONEFLOW_CORE_VM_BARRIER_PHY_INSTR_OPERAND_H_

#include <functional>
#include "oneflow/core/vm/phy_instr_operand.h"
Expand All @@ -23,14 +23,16 @@ namespace oneflow {
namespace vm {

// no arg callback physical instruction operand
class NoArgCbPhyInstrOperand : public PhyInstrOperand {
class BarrierPhyInstrOperand : public PhyInstrOperand {
public:
NoArgCbPhyInstrOperand(const std::function<void()>& callback) : callback_(callback) {
BarrierPhyInstrOperand(const std::function<void()>& callback) : callback_(callback) {
stream_sequential_dependence_ = nullptr;
}
~NoArgCbPhyInstrOperand() = default;

const std::function<void()>& callback() const { return callback_; }
~BarrierPhyInstrOperand() {
// Make sure barrier callbacks run after all objects of previous instructions are destructed in
// Callback thread.
callback_();
}

const DependenceVector& input_dependences() const override {
static DependenceVector dependences{};
Expand All @@ -48,4 +50,4 @@ class NoArgCbPhyInstrOperand : public PhyInstrOperand {
} // namespace vm
} // namespace oneflow

#endif // ONEFLOW_CORE_VM_NO_ARG_CB_PHY_INSTR_OPERAND_H_
#endif // ONEFLOW_CORE_VM_BARRIER_PHY_INSTR_OPERAND_H_
4 changes: 3 additions & 1 deletion oneflow/core/vm/instruction.h
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ class InstructionMsg final : public intrusive::Base {

intrusive::shared_ptr<InstructionMsg> Clone() const;

intrusive::Ref::RefCntType ref_cnt() const { return intrusive_ref_.ref_cnt(); }

private:
friend class intrusive::Ref;
intrusive::Ref* mut_intrusive_ref() { return &intrusive_ref_; }
Expand Down Expand Up @@ -184,7 +186,7 @@ class Instruction final : public intrusive::Base {
Stream* mut_stream() { return stream_; }
InstructionMsg* mut_instr_msg() {
if (unlikely(!instr_msg_)) { instr_msg_ = intrusive::make_shared<InstructionMsg>(); }
return instr_msg_.Mutable();
return CHECK_NOTNULL(instr_msg_.Mutable());
}
void reset_instr_msg(InstructionMsg* instr_msg) { instr_msg_.Reset(instr_msg); }
void clear_instr_msg() { instr_msg_.Reset(); }
Expand Down
15 changes: 4 additions & 11 deletions oneflow/core/vm/sequential_instruction_type.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ limitations under the License.
#include "oneflow/core/vm/instruction_type.h"
#include "oneflow/core/vm/instruction.h"
#include "oneflow/core/vm/virtual_machine_engine.h"
#include "oneflow/core/vm/no_arg_cb_phy_instr_operand.h"
#include "oneflow/core/vm/barrier_phy_instr_operand.h"
#include "oneflow/core/control/global_process_ctx.h"

namespace oneflow {
Expand All @@ -34,13 +34,6 @@ class RankFrontSeqCallbackInstructionType : public InstructionType {
bool IsFrontSequential() const override { return true; }

protected:
void Run(const InstructionMsg& instr_msg) const {
const auto& phy_instr_operand = instr_msg.phy_instr_operand();
CHECK(static_cast<bool>(phy_instr_operand));
const auto* ptr = dynamic_cast<const NoArgCbPhyInstrOperand*>(phy_instr_operand.get());
CHECK_NOTNULL(ptr);
ptr->callback()();
}
};

class ComputeRankFrontSeqCallbackInstructionType final
Expand All @@ -51,8 +44,8 @@ class ComputeRankFrontSeqCallbackInstructionType final

using stream_type = ControlStreamType;

void Compute(Instruction* instruction) const override { Run(instruction->instr_msg()); }
void ComputeInFuseMode(InstructionMsg* instr_msg) const override { Run(*instr_msg); }
void Compute(Instruction* instruction) const override {}
void ComputeInFuseMode(InstructionMsg* instr_msg) const override {}
};
COMMAND(RegisterInstructionType<ComputeRankFrontSeqCallbackInstructionType>(
"ComputeRankFrontSeqCallback"));
Expand All @@ -65,7 +58,7 @@ class CtrlComputeRankFrontSeqCallbackInstructionType final

using stream_type = ControlStreamType;

void Compute(Instruction* instruction) const override { Run(instruction->instr_msg()); }
void Compute(Instruction* instruction) const override {}
};
COMMAND(RegisterInstructionType<CtrlComputeRankFrontSeqCallbackInstructionType>(
"CtrlComputeRankFrontSeqCallback"));
Expand Down
2 changes: 1 addition & 1 deletion oneflow/core/vm/stream.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@ void Stream::MoveToFreeList(intrusive::shared_ptr<Instruction>&& instruction) {
CHECK_EQ(instruction->ref_cnt(), 1);
auto* instruction_ptr = instruction.Mutable();
mut_free_instruction_list()->EmplaceBack(std::move(instruction));
instruction_ptr->Delete();
}

void Stream::MoveFromZombieListToFreeList() {
Expand All @@ -85,6 +84,7 @@ void Stream::DeleteInstruction(intrusive::shared_ptr<Instruction>&& instruction)
CHECK(instruction->instruction_hook().empty());
CHECK(instruction->pending_instruction_hook().empty());
CHECK(instruction->dispatched_instruction_hook().empty());
instruction->Delete();
// the value of instruction->ref_cnt() may be updated by a worker thread
size_t ref_cnt = instruction->ref_cnt();
if (ref_cnt == 1) {
Expand Down
4 changes: 2 additions & 2 deletions oneflow/core/vm/virtual_machine.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ limitations under the License.
#include "oneflow/core/vm/virtual_machine.h"
#include "oneflow/core/vm/instruction.h"
#include "oneflow/core/vm/instruction_type.h"
#include "oneflow/core/vm/no_arg_cb_phy_instr_operand.h"
#include "oneflow/core/vm/barrier_phy_instr_operand.h"
#include "oneflow/core/vm/vm_util.h"
#include "oneflow/core/common/blocking_counter.h"
#include "oneflow/core/common/multi_client.h"
Expand Down Expand Up @@ -142,7 +142,7 @@ namespace {

void MakeCtrlSeqInstructions(vm::VirtualMachineEngine* vm, vm::InstructionMsgList* list,
const std::function<void()>& ComputeCallback) {
const auto& phy_instr_operand = std::make_shared<vm::NoArgCbPhyInstrOperand>(ComputeCallback);
const auto& phy_instr_operand = std::make_shared<vm::BarrierPhyInstrOperand>(ComputeCallback);
auto instruction = intrusive::make_shared<vm::InstructionMsg>(
vm, "CtrlComputeRankFrontSeqCallback", std::shared_ptr<const ParallelDesc>(),
phy_instr_operand);
Expand Down
29 changes: 24 additions & 5 deletions oneflow/core/vm/virtual_machine_engine.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ limitations under the License.
#include "oneflow/core/profiler/profiler.h"
#include "oneflow/core/common/cpp_attribute.h"
#include "oneflow/core/common/global.h"
#include "oneflow/core/common/foreign_lock_helper.h"
#include <typeinfo>

namespace oneflow {
namespace vm {
Expand Down Expand Up @@ -189,19 +191,21 @@ void VirtualMachineEngine::ReleaseFinishedInstructions() {
// in stream->DeleteInstruction(...)
intrusive::shared_ptr<InstructionMsg> instr_msg(instruction_ptr->mut_instr_msg());
stream->DeleteInstruction(LivelyInstructionListErase(instruction_ptr));
MoveInstructionMsgToGarbageMsgList(std::move(instr_msg));
static constexpr int kFlushWindowSize = 32;
MoveInstructionMsgToGarbageMsgList(kFlushWindowSize, std::move(instr_msg));
}
if (stream->running_instruction_list().empty()) { mut_active_stream_list()->Erase(stream); }
}
}

void VirtualMachineEngine::MoveInstructionMsgToGarbageMsgList(
intrusive::shared_ptr<InstructionMsg>&& instr_msg) {
int flush_window_size, intrusive::shared_ptr<InstructionMsg>&& instr_msg) {
local_garbage_msg_list_.EmplaceBack(std::move(instr_msg));
static constexpr int kWindowSize = 32;
// local_garbage_msg_list_ is the cache of garbage_msg_list_.
// `kWindowSize` controls the frequency of the usage of mutexed list.
if (unlikely(local_garbage_msg_list_.size() > kWindowSize)) { MoveToGarbageMsgListAndNotifyGC(); }
if (unlikely(local_garbage_msg_list_.size() > flush_window_size)) {
MoveToGarbageMsgListAndNotifyGC();
}
}

void VirtualMachineEngine::MoveToGarbageMsgListAndNotifyGC() {
Expand Down Expand Up @@ -495,7 +499,11 @@ void VirtualMachineEngine::TryRunBarrierInstruction() {
CHECK(OnSchedulerThread(stream_type));
stream_type.Run(sequnential_instruction);
mut_barrier_instruction_list()->Erase(sequnential_instruction);
intrusive::shared_ptr<InstructionMsg> instr_msg(sequnential_instruction->mut_instr_msg());
LivelyInstructionListErase(sequnential_instruction);
sequnential_instruction->clear_instr_msg();
constexpr int kZeroWindowSize = 0; // flush immediately.
MoveInstructionMsgToGarbageMsgList(kZeroWindowSize, std::move(instr_msg));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里支持了在sync vm时,立即发送指令的GC

OF_PROFILER_RANGE_POP();
}

Expand Down Expand Up @@ -534,7 +542,18 @@ void VirtualMachineEngine::Schedule() {
void VirtualMachineEngine::Callback() {
InstructionMsgList garbage_msg_list;
mut_garbage_msg_list()->MoveTo(&garbage_msg_list);
// destruct garbage_msg_list.
INTRUSIVE_FOR_EACH(garbage, &garbage_msg_list) {
CHECK_JUST(Global<ForeignLockHelper>::Get()->WithScopedAcquire([&, this]() -> Maybe<void> {
garbage_msg_list.Erase(garbage.Mutable());
while (garbage->ref_cnt() > 1) {
// Do nothing. wait until all other threads ref_cnts released.
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call back线程busy wait其它线程释放指令,以确认该指令在其它所有线程使用完成了

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main线程的sync会等待这个CallBack执行完成?

Copy link
Contributor Author

@lixinqi lixinqi Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个部分是需要main线程代码自行用bc来处理。
https://github.com/Oneflow-Inc/oneflow/blob/master/oneflow/core/vm/vm_util.cpp#L44

Copy link
Contributor

@strint strint Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里之前不太熟悉,尝试整理一下。

1、当调用Sync时,被视为是BarrierInstruction

Maybe<void> ClusterSync() {
  auto bc = std::make_shared<BlockingCounter>(1);
  JUST(PhysicalRun([bc](InstructionsBuilder* builder) -> Maybe<void> {
    JUST(builder->ComputeGlobalFrontSeqBarrier());  // 是FrontSeq的指令,会被TryRunBarrierInstruction执行
    JUST(builder->ComputeRankFrontSeqCallback([bc]() { bc->Decrease(); }));  // 指令callback执行时,就bc减1
    return Maybe<void>::Ok();
  }));
  JUST(bc->WaitUntilCntEqualZero(VirtualMachine::GetPredicatorNoMoreInstructionsFinished())); // 主线程等待bc为0,效果是阻塞主线程等待BarrierInstruction的callback执行
  return Maybe<void>::Ok();
}

2、TryRunBarrierInstruction时,立即做指令的gc,触发BarrierInstruction及之前的所有指令的callback的执行

  constexpr int kZeroWindowSize = 0;  // flush immediately.
  MoveInstructionMsgToGarbageMsgList(kZeroWindowSize, std::move(instr_msg));

3、callback线程执行gc得到的所有指令的callback

void VirtualMachine::CallbackLoop(const std::function<void()>& Initializer) {
  Initializer();
  auto* vm = mut_vm();
  while (callback_notifier_.WaitAndClearNotifiedCnt() == kNotifierStatusSuccess) { vm->Callback(); }
}

执行有两个特点:

  • a、顺序执行,保证BarrierInstruction的callback在最后被执行;
  • b、每个被gc的指令都busy wait该指令在其它线程都被释放了,证明指令在其它线程已经执行完了;

这样gc中的callback执行完,代表BarrierInstruction及之前的指令都执行完了,bc置0,main线程继续。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 busy wait 会有性能问题吗? cpu 占用率过高等问题

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 busy wait 会有性能问题吗? cpu 占用率过高等问题

我更新了第548行的注释。这里只是处理一个非常罕见的情况,那个罕见情况会导致对象不是在callback线程里释放,这违背我们的本意。
既然上层决定交给garbage list了,那一定说明就是主动要求在callback线程里释放东西。那个罕见的间隙时间会非常短。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

CHECK_NOTNULL(garbage->phy_instr_operand());
CHECK_EQ(garbage->phy_instr_operand().use_count(), 1) << garbage->DebugName();
// Destruct garbage.
return Maybe<void>::Ok();
}));
}
}

void VirtualMachineEngine::NotifyCallback() { MoveToGarbageMsgListAndNotifyGC(); }
Expand Down
3 changes: 2 additions & 1 deletion oneflow/core/vm/virtual_machine_engine.h
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,8 @@ class VirtualMachineEngine final : public intrusive::Base {
ReadyInstructionList* mut_ready_instruction_list() { return &ready_instruction_list_; }

void ReleaseFinishedInstructions();
void MoveInstructionMsgToGarbageMsgList(intrusive::shared_ptr<InstructionMsg>&& instr_msg);
void MoveInstructionMsgToGarbageMsgList(int flush_window_size,
intrusive::shared_ptr<InstructionMsg>&& instr_msg);
void MoveToGarbageMsgListAndNotifyGC();
void HandleLocalPending();
void GetRewritedPendingInstructionsByWindowSize(size_t window_size,
Expand Down