Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adam learning algorithm error #1040

Closed
PseudoProgrammer opened this issue Dec 30, 2016 · 5 comments
Closed

adam learning algorithm error #1040

PseudoProgrammer opened this issue Dec 30, 2016 · 5 comments
Assignees

Comments

@PseudoProgrammer
Copy link

PseudoProgrammer commented Dec 30, 2016

adam learning algorithm get an error, but it works well when i set adagrad
adam settings

Settings(
    algorithm='sgd',
    learning_rate=0.01,
    learning_method = 'adam',
    adam_beta1 = 0.9,
    adam_beta2 = 0.999,
    ada_epsilon = 1e-6,
    ada_rou = 0.95,
    batch_size = 789,
    learning_rate_decay_a=0,
    learning_rate_decay_b=0,
    num_batches_per_send_parameter=1,
    num_batches_per_get_parameter=1,
  )

adagrad settings:

Settings(
    algorithm='sgd',
    learning_rate=0.01,
    learning_method = 'adagrad',
    ada_epsilon = 1e-6,
    ada_rou = 0.95,
    batch_size = 789,
    learning_rate_decay_a=0,
    learning_rate_decay_b=0,
    num_batches_per_send_parameter=1,
    num_batches_per_get_parameter=1,
  )

train.log:

Fri Dec 30 01:25:25 2016[1,17]<stderr>:+ ./paddle_trainer --num_gradient_servers=40 --trainer_id=17 --pservers=10.90.165.41,10.90.165.39,10.90.165.37,10.90.165.38,10.90.165.35,10.90.165.36,10.90.165.33,10.90.165.34,10.90.165.31,10.90.165.32,10.90.165.30,10.90.168.20,10.90.168.21,10.90.168.22,10.90.168.23,10.90.168.24,10.90.168.25,10.90.168.26,10.90.168.27,10.90.168.28,10.90.168.29,10.90.168.32,10.90.168.33,10.90.168.30,10.90.168.31,10.90.168.36,10.90.168.37,10.90.168.34,10.90.168.35,10.90.168.38,10.90.168.39,10.90.168.40,10.90.168.41,10.90.168.42,10.90.168.43,10.90.168.44,10.90.102.42,10.90.102.41,10.90.102.44,10.90.102.43 --rdma_tcp=tcp --nics=xgbe0 --saving_period=5 --port=7164 --ports_num=1 --local=0 --comment=_job.16597.instances --dot_period=1000 --log_period=1000 --num_passes=5000 --trainer_count=10 --load_missing_parameter_strategy=rand --config=conf/trainer_config.conf --save_dir=./output --python_path=./python-gcc345 --python_bin=python2.7 --use_gpu=0
Fri Dec 30 01:25:41 2016[1,18]<stderr>:F1230 01:25:41.543866  7156 BaseClient.cpp:25] Check failed: numPorts > 0 (0 vs. 0) 
Fri Dec 30 01:25:41 2016[1,18]<stderr>:*** Check failure stack trace: ***
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x8d7788  google::LogMessage::Fail()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x8d76e0  google::LogMessage::SendToLog()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x8d7175  google::LogMessage::Flush()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x8d9f36  google::LogMessageFatal::~LogMessageFatal()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x766304  paddle::BaseClient::BaseClient()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x76c909  paddle::ParameterClient2::ParameterClient2()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x75538f  paddle::SparseRemoteParameterUpdater::init()
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @           0x74b437  _ZNSt6thread5_ImplISt12_Bind_resultIvFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @     0x7fcc36ea0462  execute_native_thread_routine
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @     0x7fcc37529d30  start_thread
Fri Dec 30 01:25:41 2016[1,18]<stderr>:    @     0x7fcc366f0afd  clone
@backyes
Copy link
Contributor

backyes commented Dec 30, 2016

@PseudoProgrammer

Can you check you model config if sparse_update is enabled?

Fri Dec 30 01:25:41 2016[1,18]:F1230 01:25:41.543866 7156 BaseClient.cpp:25] Check failed: numPorts > 0 (0 vs. 0)

It means probably you are using sparse_update trainning. If true, please add --port_num_for_sparse option.

@backyes backyes self-assigned this Dec 30, 2016
@PseudoProgrammer
Copy link
Author

It comes to another error after i add sparse_update
my config:
server_arg="--port=7164 --ports_num=1 --ports_num_for_sparse=1 --pserver_num_threads=5"
test_server_arg="--ports_num_for_sparse=1 --pserver_num_threads=5 --port=17164 --ports_num=1"

train_arg="--saving_period=1 --port=7164 --ports_num=1 --local=0 --comment=$comment --dot_period=1000 --log_period=1000 --num_passes=100 --trainer_count=10 --ports_num_for_sparse=1 --use_sparse_updater=1 --use_old_updater=1 --enable_grad_sparse_update=50000000 --grad_sparse_update_max_sparse_rate=0.50"

test_arg="--port=7164 --ports_num=1 --distribute_test=0 --job=test --test_pass=0 --test_wait=1 --dot_period=1 --log_period=1000 --saving_period=1 --num_passes=500 --start_pserver=0"

Layer(inputs = [Input("input1", parameter_name = "_layer1_1.w", sparse_remote_update = True)], name = "layer1_1", bias = Bias
(parameter_name = "_layer1_1.bias"), active_type = "tanh", type = "fc", size = 128)

train.log

Fri Dec 30 12:59:00 2016[1,0]:+ ./paddle_trainer --num_gradient_servers=40 --trainer_id=0 --pservers=10.90.163.38,10.90.163.37,10.90.136.26,10.90.136.24,10.90.136.25,10.90.163.32,10.90.163.31,10.90.136.23,10.90.163.30,10.90.163.44,10.90.163.41,10.90.163.40,10.90.163.43,10.90.163.42,10.90.163.19,10.90.163.18,10.90.163.17,10.90.163.16,10.90.163.15,10.90.163.14,10.90.163.13,10.90.163.12,10.90.163.11,10.90.163.27,10.90.163.26,10.90.163.29,10.90.163.28,10.90.163.23,10.90.163.22,10.90.163.25,10.90.163.24,10.90.163.20,10.90.139.20,10.90.139.21,10.90.139.13,10.90.139.14,10.90.139.15,10.90.139.16,10.90.139.19,10.90.148.44 --rdma_tcp=tcp --nics=xgbe0 --saving_period=1 --port=7164 --ports_num=1 --local=0 --comment=_job.16646.instances --dot_period=1000 --log_period=1000 --num_passes=100 --trainer_count=10 --ports_num_for_sparse=1 --use_sparse_updater=1 --use_old_updater=1 --enable_grad_sparse_update=50000000 --grad_sparse_update_max_sparse_rate=0.50 --config=conf/trainer_config.conf --save_dir=./output --python_path=./python-gcc345 --python_bin=python2.7 --use_gpu=0
Fri Dec 30 13:03:02 2016[1,5]:*** Aborted at 1483074182 (unix time) try "date -d @1483074182" if you are using GNU date ***
Fri Dec 30 13:03:02 2016[1,5]:PC: @ 0x7a3044 paddle::ProtoClient::recv()
Fri Dec 30 13:03:02 2016[1,5]:*** SIGSEGV (@0x8) received by PID 37076 (TID 0x7f7acf5fe700) from PID 8; stack trace: ***
Fri Dec 30 13:03:02 2016[1,5]: @ 0x7f7fb9d9c200 (unknown)
Fri Dec 30 13:03:02 2016[1,5]: @ 0x7a3044 paddle::ProtoClient::recv()
Fri Dec 30 13:03:02 2016[1,5]: @ 0x76e297 paddle::ParameterClient2::sendParallel()
Fri Dec 30 13:03:02 2016[1,5]: @ 0x74b437 _ZNSt6thread5_ImplISt12_Bind_resultIvFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Fri Dec 30 13:03:02 2016[1,5]: @ 0x7f7fb970b462 execute_native_thread_routine
Fri Dec 30 13:03:02 2016[1,5]: @ 0x7f7fb9d94d30 start_thread
Fri Dec 30 13:03:02 2016[1,5]: @ 0x7f7fb8f5bafd clone
Fri Dec 30 13:03:02 2016[1,33]:*** Aborted at 1483074182 (unix time) try "date -d @1483074182" if you are using GNU date ***

server.log
Fri Dec 30 12:55:24 2016[1,31]:+ ./paddle_pserver2 --num_gradient_servers=40 --nics=xgbe0 --port=7164 --ports_num=1 --ports_num_for_sparse=1 --pserver_num_threads=5 --rdma_tcp=tcp --comment=_job.16646.instances
Fri Dec 30 13:03:02 2016[1,5]:F1230 13:03:02.362143 34494 SgdOptimizer.cpp:291] Check failed: sparseId == -1UL Sparse update is not supported
Fri Dec 30 13:03:02 2016[1,5]:*** Check failure stack trace: ***
Fri Dec 30 13:03:02 2016[1,5]:F1230 13:03:02.362349 34493 SgdOptimizer.cpp:291] Check failed: sparseId == -1UL Sparse update is not supportedF1230 13:03:02.362406 34492 SgdOptimizer.cpp:291] Check failed: sparseId == -1UL Sparse update is not supportedF1230 13:03:02.362432 34491 SgdOptimizer.cpp:291] Check failed: sparseId == -1UL Sparse update is not supported
Fri Dec 30 13:03:02 2016[1,5]:*** Check failure stack trace: ***

@PseudoProgrammer
Copy link
Author

i use adamax

Settings(
algorithm='sgd',
learning_rate=0.01,
learning_method = 'adamax',
adam_beta1 = 0.9,
adam_beta2 = 0.999,
ada_epsilon = 1e-6,
ada_rou = 0.95,
batch_size = 789,
learning_rate_decay_a=0,
learning_rate_decay_b=0,
num_batches_per_send_parameter=1,
num_batches_per_get_parameter=1,
)

@backyes
Copy link
Contributor

backyes commented Dec 30, 2016

duplicated

#1042

  • adamax * 不支持sparse update 训练。

If not sparse update enabled, parameter level L1 regulation is not available, we are working on it.

Please use some optimizer supporting sparse_update, then L1 and L2 with per-parameter setting are available.

Related issue:
#273
#985

@PseudoProgrammer
Copy link
Author

which algorithm support sparse update in the three of below algorithms?
adam
adadelta
rmsprop

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants