We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Happens on the trainer, after the training has run for a while.
In my setting I have changed the dist fit a line to run for 1000 passes, it happens frequently (2 out of 3 tries).
Commands:
GLOG_logtostderr=1 GLOG_v=3 PSERVERS=172.17.0.5:6174 SERVER_ENDPOINT=172.17.0.5:6174 TRAINING_ROLE=PSERVER python notest_dist_fit_a_line.py GLOG_logtostderr=1 GLOG_v=3 PSERVERS=172.17.0.5:6174 SERVER_ENDPOINT=172.17.0.5:6174 TRAINING_ROLE=TRAINER python notest_dist_fit_a_line.py GLOG_logtostderr=1 GLOG_v=0 PSERVERS=172.17.0.5:6174 SERVER_ENDPOINT=172.17.0.5:6174 TRAINING_ROLE=TRAINER python notest_dist_fit_a_line.py
notest_dist_fit_a_line.py is taken from here
I0119 21:26:14.525514 16639 send_op.cc:44] sending fc_0.w_0@GRAD I0119 21:26:14.525590 16639 send_op.cc:44] sending fc_0.b_0@GRAD E0119 21:26:14.529606 16639 grpc_client.cc:119] proc param error:name:[fc_0.w_0@GRAD] ep:[172.17.0.5:6174] grpc error:Connect Failed Traceback (most recent call last): File "notest_dist_fit_a_line.py", line 70, in <module> fetch_list=[avg_cost]) File "/root/.local/lib/python2.7/site-packages/paddle/v2/fluid/executor.py", line 177, in run self.executor.run(program.desc, scope, 0, True, True) paddle.v2.fluid.core.EnforceNotMet: at [/home/helin/repo/Paddle/paddle/operators/send_op.cc:47] PaddlePaddle Call Stacks: 0 0x7faab725cf17p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 727 1 0x7faab7aefaacp paddle::operators::SendOp::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 2988 2 0x7faab7310107p paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool) + 1463 3 0x7faab7275893p void pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, paddle::framework::Executor, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, pybind11::name, pybind11::is_method, pybind11::sibling>(void (paddle::framework::Executor::*)(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(paddle::framework::Executor*, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool)#1}, void, paddle::framework::Executor*, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<void, paddle::framework::Executor, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, pybind11::name, pybind11::is_method, pybind11::sibling>(void (paddle::framework::Executor::*)(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(paddle::framework::Executor*, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool)#1}&&, void (*)(paddle::framework::Executor*, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) + 579 4 0x7faab72734e4p pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 1236 5 0x4cad00p PyEval_EvalFrameEx + 28048 6 0x4c2705p PyEval_EvalCodeEx + 597 7 0x4ca088p PyEval_EvalFrameEx + 24856 8 0x4c2705p PyEval_EvalCodeEx + 597 9 0x4c24a9p PyEval_EvalCode + 25 10 0x4f19efp 11 0x4ec372p PyRun_FileExFlags + 130 12 0x4eaaf1p PyRun_SimpleFileExFlags + 401 13 0x49e208p Py_Main + 1736 14 0x7fab4c825830p __libc_start_main + 240 15 0x49da59p _start + 41
The text was updated successfully, but these errors were encountered:
I think we need retries: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/detail/grpc_client.cc#L116
Sorry, something went wrong.
proc param error:name:[fc_0.w_0@GRAD] ep:[172.17.0.5:6174] grpc error:Connect Failed
This crash maybe because the pserver exits or not exist. And what's the reason for pserver's errors?
I keep the training program run for few minutes, and get this error. The pserver is still on.
Intuitively, there should not be a frequent problem of network sending and receiving data on a local machine. Is there logic error we don't notice?
Can not reproduce on the latest develop branch anymore.
abhinavarora
Yancey1989
helinwang
putcn
gongweibao
typhoonzero
No branches or pull requests
Happens on the trainer, after the training has run for a while.
In my setting I have changed the dist fit a line to run for 1000 passes, it happens frequently (2 out of 3 tries).
Commands:
notest_dist_fit_a_line.py is taken from here
The text was updated successfully, but these errors were encountered: