adam learning algorithm error #1040

PseudoProgrammer · 2016-12-30T03:26:15Z

adam learning algorithm get an error， but it works well when i set adagrad
adam settings

Settings(
    algorithm='sgd',
    learning_rate=0.01,
    learning_method = 'adam',
    adam_beta1 = 0.9,
    adam_beta2 = 0.999,
    ada_epsilon = 1e-6,
    ada_rou = 0.95,
    batch_size = 789,
    learning_rate_decay_a=0,
    learning_rate_decay_b=0,
    num_batches_per_send_parameter=1,
    num_batches_per_get_parameter=1,
  )

adagrad settings：

Settings(
    algorithm='sgd',
    learning_rate=0.01,
    learning_method = 'adagrad',
    ada_epsilon = 1e-6,
    ada_rou = 0.95,
    batch_size = 789,
    learning_rate_decay_a=0,
    learning_rate_decay_b=0,
    num_batches_per_send_parameter=1,
    num_batches_per_get_parameter=1,
  )

train.log：

Fri Dec 30 01:25:25 2016[1,17]<stderr>:+ ./paddle_trainer --num_gradient_servers=40 --trainer_id=17 --pservers=10.90.165.41,10.90.165.39,10.90.165.37,10.90.165.38,10.90.165.35,10.90.165.36,10.90.165.33,10.90.165.34,10.90.165.31,10.90.165.32,10.90.165.30,10.90.168.20,10.90.168.21,10.90.168.22,10.90.168.23,10.90.168.24,10.90.168.25,10.90.168.26,10.90.168.27,10.90.168.28,10.90.168.29,10.90.168.32,10.90.168.33,10.90.168.30,10.90.168.31,10.90.168.36,10.90.168.37,10.90.168.34,10.90.168.35,10.90.168.38,10.90.168.39,10.90.168.40,10.90.168.41,10.90.168.42,10.90.168.43,10.90.168.44,10.90.102.42,10.90.102.41,10.90.102.44,10.90.102.43 --rdma_tcp=tcp --nics=xgbe0 --saving_period=5 --port=7164 --ports_num=1 --local=0 --comment=_job.16597.instances --dot_period=1000 --log_period=1000 --num_passes=5000 --trainer_count=10 --load_missing_parameter_strategy=rand --config=conf/trainer_config.conf --save_dir=./output --python_path=./python-gcc345 --python_bin=python2.7 --use_gpu=0
Fri Dec 30 01:25:41 2016[1,18]<stderr>:F1230 01:25:41.543866  7156 BaseClient.cpp:25] Check failed: numPorts > 0 (0 vs. 0) 
Fri Dec 30 01:25:41 2016[1,18]<stderr>:*** Check failure stack trace: ***
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x8d7788  google::LogMessage::Fail()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x8d76e0  google::LogMessage::SendToLog()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x8d7175  google::LogMessage::Flush()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x8d9f36  google::LogMessageFatal::~LogMessageFatal()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x766304  paddle::BaseClient::BaseClient()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x76c909  paddle::ParameterClient2::ParameterClient2()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x75538f  paddle::SparseRemoteParameterUpdater::init()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x74b437  _ZNSt6thread5_ImplISt12_Bind_resultIvFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @     0x7fcc36ea0462  execute_native_thread_routine
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @     0x7fcc37529d30  start_thread
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @     0x7fcc366f0afd  clone

The text was updated successfully, but these errors were encountered:

backyes · 2016-12-30T03:39:35Z

@PseudoProgrammer

Can you check you model config if sparse_update is enabled?

Fri Dec 30 01:25:41 2016[1,18]:F1230 01:25:41.543866 7156 BaseClient.cpp:25] Check failed: numPorts > 0 (0 vs. 0)

It means probably you are using sparse_update trainning. If true, please add --port_num_for_sparse option.

PseudoProgrammer · 2016-12-30T05:12:52Z

It comes to another error after i add sparse_update
my config:
server_arg="--port=7164 --ports_num=1 --ports_num_for_sparse=1 --pserver_num_threads=5"
test_server_arg="--ports_num_for_sparse=1 --pserver_num_threads=5 --port=17164 --ports_num=1"

train_arg="--saving_period=1 --port=7164 --ports_num=1 --local=0 --comment=$comment --dot_period=1000 --log_period=1000 --num_passes=100 --trainer_count=10 --ports_num_for_sparse=1 --use_sparse_updater=1 --use_old_updater=1 --enable_grad_sparse_update=50000000 --grad_sparse_update_max_sparse_rate=0.50"

test_arg="--port=7164 --ports_num=1 --distribute_test=0 --job=test --test_pass=0 --test_wait=1 --dot_period=1 --log_period=1000 --saving_period=1 --num_passes=500 --start_pserver=0"

Layer(inputs = [Input("input1", parameter_name = "_layer1_1.w", sparse_remote_update = True)], name = "layer1_1", bias = Bias
(parameter_name = "_layer1_1.bias"), active_type = "tanh", type = "fc", size = 128)

train.log

Fri Dec 30 12:59:00 2016[1,0]:+ ./paddle_trainer --num_gradient_servers=40 --trainer_id=0 --pservers=10.90.163.38,10.90.163.37,10.90.136.26,10.90.136.24,10.90.136.25,10.90.163.32,10.90.163.31,10.90.136.23,10.90.163.30,10.90.163.44,10.90.163.41,10.90.163.40,10.90.163.43,10.90.163.42,10.90.163.19,10.90.163.18,10.90.163.17,10.90.163.16,10.90.163.15,10.90.163.14,10.90.163.13,10.90.163.12,10.90.163.11,10.90.163.27,10.90.163.26,10.90.163.29,10.90.163.28,10.90.163.23,10.90.163.22,10.90.163.25,10.90.163.24,10.90.163.20,10.90.139.20,10.90.139.21,10.90.139.13,10.90.139.14,10.90.139.15,10.90.139.16,10.90.139.19,10.90.148.44 --rdma_tcp=tcp --nics=xgbe0 --saving_period=1 --port=7164 --ports_num=1 --local=0 --comment=_job.16646.instances --dot_period=1000 --log_period=1000 --num_passes=100 --trainer_count=10 --ports_num_for_sparse=1 --use_sparse_updater=1 --use_old_updater=1 --enable_grad_sparse_update=50000000 --grad_sparse_update_max_sparse_rate=0.50 --config=conf/trainer_config.conf --save_dir=./output --python_path=./python-gcc345 --python_bin=python2.7 --use_gpu=0
Fri Dec 30 13:03:02 2016[1,5]:*** Aborted at 1483074182 (unix time) try "date -d @1483074182" if you are using GNU date ***
Fri Dec 30 13:03:02 2016[1,5]:PC: @ 0x7a3044 paddle::ProtoClient::recv()
Fri Dec 30 13:03:02 2016[1,5]:*** SIGSEGV (@0x8) received by PID 37076 (TID 0x7f7acf5fe700) from PID 8; stack trace: ***
Fri Dec 30 13:03:02 2016[1,5]: @ 0x7f7fb9d9c200 (unknown)
Fri Dec 30 13:03:02 2016[1,5]: @ 0x7a3044 paddle::ProtoClient::recv()
Fri Dec 30 13:03:02 2016[1,5]: @ 0x76e297 paddle::ParameterClient2::sendParallel()
Fri Dec 30 13:03:02 2016[1,5]: @ 0x74b437 _ZNSt6thread5_ImplISt12_Bind_resultIvFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Fri Dec 30 13:03:02 2016[1,5]: @ 0x7f7fb970b462 execute_native_thread_routine
Fri Dec 30 13:03:02 2016[1,5]: @ 0x7f7fb9d94d30 start_thread
Fri Dec 30 13:03:02 2016[1,5]: @ 0x7f7fb8f5bafd clone
Fri Dec 30 13:03:02 2016[1,33]:*** Aborted at 1483074182 (unix time) try "date -d @1483074182" if you are using GNU date ***

server.log
Fri Dec 30 12:55:24 2016[1,31]:+ ./paddle_pserver2 --num_gradient_servers=40 --nics=xgbe0 --port=7164 --ports_num=1 --ports_num_for_sparse=1 --pserver_num_threads=5 --rdma_tcp=tcp --comment=_job.16646.instances
Fri Dec 30 13:03:02 2016[1,5]:F1230 13:03:02.362143 34494 SgdOptimizer.cpp:291] Check failed: sparseId == -1UL Sparse update is not supported
Fri Dec 30 13:03:02 2016[1,5]:*** Check failure stack trace: ***
Fri Dec 30 13:03:02 2016[1,5]:F1230 13:03:02.362349 34493 SgdOptimizer.cpp:291] Check failed: sparseId == -1UL Sparse update is not supportedF1230 13:03:02.362406 34492 SgdOptimizer.cpp:291] Check failed: sparseId == -1UL Sparse update is not supportedF1230 13:03:02.362432 34491 SgdOptimizer.cpp:291] Check failed: sparseId == -1UL Sparse update is not supported
Fri Dec 30 13:03:02 2016[1,5]:*** Check failure stack trace: ***

PseudoProgrammer · 2016-12-30T05:17:54Z

i use adamax

Settings(
algorithm='sgd',
learning_rate=0.01,
learning_method = 'adamax',
adam_beta1 = 0.9,
adam_beta2 = 0.999,
ada_epsilon = 1e-6,
ada_rou = 0.95,
batch_size = 789,
learning_rate_decay_a=0,
learning_rate_decay_b=0,
num_batches_per_send_parameter=1,
num_batches_per_get_parameter=1,
)

backyes · 2016-12-30T05:19:38Z

duplicated

#1042

adamax * 不支持sparse update 训练。

If not sparse update enabled, parameter level L1 regulation is not available, we are working on it.

Please use some optimizer supporting sparse_update, then L1 and L2 with per-parameter setting are available.

Related issue:
#273
#985

PseudoProgrammer · 2016-12-30T05:36:26Z

which algorithm support sparse update in the three of below algorithms？
adam
adadelta
rmsprop

Committer: liym27 <liym0923@126.com>

backyes self-assigned this Dec 30, 2016

backyes closed this as completed Dec 30, 2016

backyes added the duplicate label Dec 30, 2016

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019

Add Print in programming_guide, test=develop (PaddlePaddle#1040)

3209aeb

Committer: liym27 <liym0923@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adam learning algorithm error #1040

adam learning algorithm error #1040

PseudoProgrammer commented Dec 30, 2016 •

edited by Xreki

Loading

backyes commented Dec 30, 2016

PseudoProgrammer commented Dec 30, 2016

PseudoProgrammer commented Dec 30, 2016

backyes commented Dec 30, 2016

PseudoProgrammer commented Dec 30, 2016

adam learning algorithm error #1040

adam learning algorithm error #1040

Comments

PseudoProgrammer commented Dec 30, 2016 • edited by Xreki Loading

backyes commented Dec 30, 2016

PseudoProgrammer commented Dec 30, 2016

PseudoProgrammer commented Dec 30, 2016

backyes commented Dec 30, 2016

PseudoProgrammer commented Dec 30, 2016

PseudoProgrammer commented Dec 30, 2016 •

edited by Xreki

Loading