[deeplab] Training deeplab model with ADE20K dataset #3730

walkerlala · 2018-03-24T12:51:41Z

System information

What is the top-level directory of the model you are using: deeplab
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.6.0
Bazel version (if compiling from source):
CUDA/cuDNN version: 9.0/7.0.4
GPU model and memory: 1080Ti * 2 , 10Gb * 2
Exact command to reproduce:

Describe the problem

This is a feature request. I am trying to train the deeplab model with the ADE20K dataset (see this presentation). I have finished the data format conversion and "successfully" train the model on a small subset of ADE20K. Below is the modification to file research/deeplab/datasets/segmentation_dataset.py which is used to extract segmentation data.

diff --git a/research/deeplab/datasets/segmentation_dataset.py b/research/deeplab/datasets/segmentation_dataset.py
index a777252..8648fb2 100644
--- a/research/deeplab/datasets/segmentation_dataset.py
+++ b/research/deeplab/datasets/segmentation_dataset.py
@@ -85,10 +85,22 @@ _PASCAL_VOC_SEG_INFORMATION = DatasetDescriptor(
     ignore_label=255,
 )
 
+_ADE20K_INFORMATION = DatasetDescriptor(
+    splits_to_sizes = {
+        'train': 40,
+        'val': 5,
+    },
+    # TODO temporarily change it to 21 otherwise dimension mismatch
+    num_classes=21,
+    ignore_label=255,
+)
+
 
 _DATASETS_INFORMATION = {
     'cityscapes': _CITYSCAPES_INFORMATION,
     'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
+    'ade20k': _ADE20K_INFORMATION,
 }
 
 # Default file pattern of TFRecord of TensorFlow Example.

The problem is, in the ADE20K dataset there are 150 classes, which is different from that in the VOC or cityspace dataset. That brings problem w.r.t the checkpoint file. Currently there are only pretrained model on the VOC and cityspace dataset. So we have two choices here:

Do not use the checkpoint file. In this case, there is an error:

absl.flags._exceptions.IllegalFlagValueError: flag --tf_initial_checkpoint=None: Flag --tf_initial_checkpoint must be specified.

set num_classes=21 to use those two provided checkpoint files

Are there any alternatives to these?

If anyone have any workable solution for the ADE20K dataset it would be really appreciated.

The text was updated successfully, but these errors were encountered:

aquariusjay · 2018-03-24T16:47:23Z

You could modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME' and also set the flag initialize_last_layer = False. (Note you still want to restore the variables in ASPP, decoder and so on). By doing so, only the weights in the last classification layer is not initialized (then you could use a classification layer with 150 classes).
You need to explore the min_resize_value and max_resize_value (set resize_factor = output_stride) for ADE20K which contains images of huge various scales (e.g., dimension ranges from 50 to 2000). In that case, by setting min_resize_value and max_resize_value, you are able to resize the images on-the-fly to the similar range (or you could do that manually by yourself while pre-processing the dataset). Note however these hyper-parameters may affect the performance, and we have not yet explored that carefully.

walkerlala · 2018-03-25T13:07:52Z

@aquariusjay Thanks for the hints. Now I have started the training, using the provided VOC model checkpoint, setting fine_tune_batch_norm to False, using the mobilenet_v2 variant and a batch size of 8. Hopefully that the loss will drop after several hours...

There are still two things confusing me:

the segmentation annotation images within the ADE20K dataset have trhee channels, but I am reading it with label_reader = build_data.ImageReader('png', channels=1) , as for what we have done for the VOC dataset (in datasets/build_voc2012_data.py). Will that be a problem?
why do we have the resize_factor parameters?

walkerlala · 2018-03-25T13:08:19Z

Oh, will it be OK to prepare a pull request for the ADE20K dataset?

aquariusjay · 2018-03-26T18:28:04Z

Regarding your previous questions:

The groundtruth images should contain only 1 channel with values = semantic labels.
You could check the code for details.

We currently do not have any plan to prepare that.
However, note that one should be able to do that by using the provided code/model/script.
Also, any contributions for extra dataset to the codebase is welcome.

Cheers,

brett-whitford · 2018-03-30T17:01:57Z

@aquariusjay,

I'm currently having similar issues attempting to train with a custom dataset and was hoping you could offer some insight.

You could modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME' and also set the flag initialize_last_layer = False.

The link you included "here" appears to need a Google SSO to login. I am assuming that was a link to the train_util.py script. Here are the changes I have currently made to implement your architecture on my custom dataset:

segmentation_dataset.py

I added the information for my "toy_dataset"

_TOY_DATASET_INFORMATION = DatasetDescriptor(
    splits_to_sizes={
        'train': 800,
        'trainval': 1000,
        'val': 200,
    },
    num_classes=10,
    ignore_label=255,
)

_DATASETS_INFORMATION = {
    'cityscapes': _CITYSCAPES_INFORMATION,
    'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
    'toy_dataset': _TOY_DATASET_INFORMATION,
}

train.py

I do not initialize the final layer of the network.
I point training to the directory containing my custom "toy_dataset"

flags.DEFINE_boolean('initialize_last_layer', False,
                     'Initialize the last layer.')

flags.DEFINE_string('dataset', 'toy_dataset',
                    'Name of the segmentation dataset.')

train_utils.py

I modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME', as you stated above.

  exclude_list = ['_LOGITS_SCOPE_NAME']
  if not initialize_last_layer:
    exclude_list.extend(last_layers)

eval.py

I point evaluation to my custom "toy_dataset".

flags.DEFINE_string('dataset', 'toy_dataset',
                    'Name of the segmentation dataset.')

However, when I run this my code appears to successfully train, but then running into an issues with the the confusion matrix during evaluation (I include the traceback below for reference). Any tips/suggestions on how to fix this?

Thanks for your help!
Brett

Error Traceback:

~/brett/wss-python/models/research/deeplab$ sh local_test_custom.sh 
Converting toy dataset...
>> Converting image 50/200 shard 0
>> Converting image 100/200 shard 1
>> Converting image 150/200 shard 2
>> Converting image 200/200 shard 3
>> Converting image 250/1000 shard 0
>> Converting image 500/1000 shard 1
>> Converting image 750/1000 shard 2
>> Converting image 1000/1000 shard 3
>> Converting image 200/800 shard 0
>> Converting image 400/800 shard 1
>> Converting image 600/800 shard 2
>> Converting image 800/800 shard 3
--2018-03-30 12:33:03--  http://download.tensorflow.org/models/deeplabv3_pascal_train_aug_2018_01_04.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.8.176, 2607:f8b0:4009:80d::2010
Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.8.176|:80... connected.
HTTP request sent, awaiting response... 416 Requested range not satisfiable

    The file is already fully retrieved; nothing to do.

toy_dataset
INFO:tensorflow:Training on trainval set
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py:731: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
INFO:tensorflow:Ignoring initialization; other checkpoint exists
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
INFO:tensorflow:Restoring parameters from /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt-11
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 11.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
toy_dataset
INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 200
INFO:tensorflow:Eval batch size 1 and num batch 200
INFO:tensorflow:Waiting for new checkpoint at /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train
INFO:tensorflow:Found new checkpoint at /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt-12
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/training/python/training/evaluation.py:303: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt-12
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting evaluation at 2018-03-30-16:35:58
Traceback (most recent call last):
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 175, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 168, in main
    eval_interval_secs=FLAGS.eval_interval_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
    timeout=timeout)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/training/python/training/evaluation.py", line 452, in evaluate_repeatedly
    session.run(eval_ops, feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 546, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1022, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1113, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1098, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1170, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 950, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [`predictions` out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency_1:0) = ] [255 255 255...] [y (mean_iou/ToInt64_2:0) = ] [10]
	 [[Node: mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_2)]]

Caused by op u'mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert', defined at:
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 175, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 142, in main
    predictions, labels, dataset.num_classes, weights=weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/metrics_impl.py", line 1009, in mean_iou
    num_classes, weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/metrics_impl.py", line 263, in _streaming_confusion_matrix
    labels, predictions, num_classes, weights=weights, dtype=dtypes.float64)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/confusion_matrix.py", line 183, in confusion_matrix
    message='`predictions` out of bound')],
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/check_ops.py", line 579, in assert_less
    return control_flow_ops.Assert(condition, data, summarize=summarize)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 118, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 177, in Assert
    guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2027, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1868, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 175, in true_assert
    condition, data, summarize, name="Assert")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 48, in _assert
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): assertion failed: [`predictions` out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency_1:0) = ] [255 255 255...] [y (mean_iou/ToInt64_2:0) = ] [10]
	 [[Node: mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_2)]]

walkerlala · 2018-03-31T06:53:42Z

3. train_utils.py • I modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME', as you stated above. exclude_list = ['_LOGITS_SCOPE_NAME'] if not initialize_last_layer: exclude_list.extend(last_layers)

this should be exclude_list = [_LOGITS_SCOPE_NAME] That is, _LOGITS_SCOPE_NAME is a variable defined else where (search for it)

wonderit · 2018-04-03T01:03:50Z

@walkerlala

I am trying to train the deeplab model with the ADE20k datasets.
I'm having some problem with data format conversion.
Would you mind sharing the code for ADE20k datasets? It would be really appreciated.

shipengai · 2018-04-03T01:22:12Z

@brett-whitford When I use my data .I have the same error with you . Can you share your solution?
Thank you very much .I 'm looking forword to your reply

walkerlala · 2018-04-03T02:20:43Z

@wonderit Of course. Please wait for a while until I have access to my GPU server.

walkerlala · 2018-04-03T10:49:18Z

@wonderit Here is the patch for converting training data and training deeplabv3 on ADE20K.

https://gist.github.com/walkerlala/82d978e68407e65158e8825cd470d7e1

(it can also be found at http://fastdrivers.org/misc/patch-for-ade20k.patch )

You can apply this patch on top of commit 1d38a22 or 5281c9a without conflict.

Note:

you can to manually adjust the path in train_ade20k.py for training and supply correct path of the training data for converting the data, as documented in the doc
training data can be found at: http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip

I am also going to submit a PR to get these into the repo. However, I don't have enough GPU to get a good pretrained model (only get two Nvidia 1080...) If you can obtain a decent pretrained model, please share!

walkerlala · 2018-04-03T12:07:34Z

Also, anyone interested in add ADE20K to deeplabv3 can take a look at this PR I just created: #3853

shipengai · 2018-04-04T02:20:11Z

@walkerlala When use val.py, did you have the error 'predictions' out of bound?just same with the @brett-whitford ' question.
Thank you

shipengai · 2018-04-08T05:41:07Z

@walkerlala Can you share your eval script?

hhwxxx · 2018-04-08T06:39:22Z

@walkerlala @aquariusjay
Hi, I am confused about the exclude_list and initialize_last_layer.

I am not sure whether I understand it correctly:
If one want to fine-tune deeplab-v3+ on another dataset, only _LOGITS_SCOPE_NAME need to be excluded?

If so, following @aquariusjay 's suggestion, in "train_utils.py":

exclude_list = [_LOGITS_SCOPE_NAME]
if not initialize_last_layer:
    exclude_list.extend(last_layers)

if set initialize_last_layer=false, then exclude_list will include the last_layers. In "train.py" last_layers is the list [_LOGITS_SCOPE_NAME, _IMAGE_POOLING_SCOPE, _ASPP_SCOPE, _CONCAT_PROJECTION_SCOPE, _DECODER_SCOPE, ].
So all variables in the list will be excluded. This seems inconsistent.

Shouldn't it be the following?
initialize_last_layer=true and exclude_list = [_LOGITS_SCOPE_NAME]

lydialixia · 2018-04-09T13:44:56Z

Hi, I'm training on my own dataset as well (only two classes).

When I set initialize_last_layer=false and

exclude_list = ['logits']
if not initialize_last_layer:
    exclude_list.extend(last_layers)

Then when I run vis.py, it gives me all black images (not binary).

When I only set initialize_last_layer=false, I got binary images (result is not good, but at least show some learning). But it gives me this when run train.py:

INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 6390723.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

when training_number_of_steps=100000

Anyone knows why this happens? Thanks!

hhwxxx · 2018-04-10T06:18:32Z

@lydialixia
Hello.
You should add 'global_step' in exclude_list:

exclude_list = ['global_step']

But I am still confused about whether one should set initialize_last_layer=false when to fine-tune deeplab-v3+ on another task.

aquariusjay · 2018-04-10T16:43:00Z

When you want to fine-tune DeepLab on other datasets, there are a few cases:

You want to re-use ALL the trained weigths: set initialize_last_layer = True (last_layers_contain_logits_only does not matter in this case).
You want to re-use ONLY the network backbone (i.e., exclude ASPP, decoder and so on): set initialize_last_layer = False and last_layers_contain_logits_only = False.
You want to re-use ALL the trained weights EXCEPT the logits (since the num_classes may be different): set initialize_last_layer = False and last_layers_contain_logits_only = True.

georgosgeorgos · 2018-04-10T19:38:52Z

Hi @walkerlala: did you manage to finetune the ADE20K dataset?
I'm trying to finetune on a dataset of the same size, but without success: after the first ~2K iterations the loss stops to decrease and starts to oscillate (~20K iterations).
I tried different learning rates, removed the regularization, but for the moment no improvement.

walkerlala · 2018-04-12T09:43:15Z

@georgosgeorgos No I can't eventually fine tune the model on ADE20K dataset. I don't have enough GPU. Every time I try to fine tune the batch normalization parameters the model blow up throwing out out-of-memory error. So I freeze the batch normalization layers when training. Finally I only got a model with only "modest" performance:

Here is the original image (too large to display here): http://www.fastdrivers.org/misc/stuffseg-origin.jpg

Here is the segmentation result:

However I can get a satisfying result with PSPNet:

According to the slides from the 2017 Coco + Places Workshop, deeplabv3 should also be able to do that, but I haven't got any luck to fine-tune that. Hopefully Google can provide a fine-tuned pre-trained model in the future @aquariusjay .

cfosco · 2018-04-15T18:36:57Z

@brett-whitford - Hi Brett, I am having the exact same problem as you. How did you end up solving it?

cfosco · 2018-04-15T18:55:42Z

@shipeng-uestc - Hi shipeng, did you manage to solve the issue? I am currently using exclude_list=[_LOGITS_SCOPE_NAME] with _LOGITS_SCOPE_NAME imported from deeplab.model as @walkerlala suggested but I am still having the same error as Brett.

jiyongma · 2018-04-16T08:41:02Z

when I run
python deeplab/eval.py
--logtostderr
--eval_split="val"
--model_variant="xception_65"
--atrous_rates=6
--atrous_rates=12
--atrous_rates=18
--output_stride=16
--decoder_output_stride=4
--eval_crop_size=513
--eval_crop_size=513
--dataset="ade20k"
--checkpoint_dir="./deeplab/datasets/ADE20K/exp/train_on_train_set/train"
--eval_logdir="./deeplab/datasets/ADE20K/exp/train_on_train_set/eval"
--dataset_dir="./deeplab/datasets/ADE20K/tfrecord"

NotFoundError (see above for traceback): Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[Node: save/RestoreV2/_299 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_306_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
please help me !!!thanks

qmy612 · 2018-04-19T04:52:56Z

@hhwxxx Hello, in your answer to lydialixia, do you mean in train_util.py, exclude_list should be like this:
exclude_list = ['global_step']
exclude_list = ['logits']

but I still can't start training, the information is:
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 30000.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

I have also tried exclude_list = ['_LOGITS_SCOPE_NAME'], this doesn't work.
When just set exclude_list = ['global_step'], the model will achieve mean iu = 0.93 after 10000 iteractions, I don't know whether this is wrong.
Waitting online, thank you!

hhwxxx · 2018-04-19T15:42:33Z

@qmy612

Hello. Maybe you can try this:
exclude_list = ['global_step', 'logits']

As to the _LOGITS_SCOPE_NAME, it is defined in "model.py", so you should use like this: model._LOGITS_SCOPE_NAME.

And I have no idea about miou=0.93.

BeSlower · 2018-05-01T18:41:16Z

Just set set initialize_last_layer = False and last_layers_contain_logits_only = True works for me, if you wanna train on your own dataset with different num classes.

holyprince · 2018-05-03T13:45:43Z

@BeSlower , yes, the solution is work for me but there is another problem that the result is all black and no other label , but during the training process , the loss is decrease. Can anyone help me ?

xianshunw · 2018-05-05T07:42:00Z

@qmy612 Did you get the problem solved? I am having the exacting problem as you

qmy612 · 2018-05-06T08:01:33Z

@xiangjinwu Yes, the answer of hhwxxx is work.
exclude_list = ['global_step', 'logits']

ajinkya933 · 2019-04-16T12:04:07Z

@apolo74 Thanks I got the output now

apolo74 · 2019-04-16T12:27:39Z

@apolo74 Thanks I got the output now

Happy to hear that!

hakS07 · 2019-06-10T20:32:32Z

@BeSlower

Just set set initialize_last_layer = False and last_layers_contain_logits_only = True works for me, if you wanna train on your own dataset with different num classes.

hi,i tried the training on my own data(classe=2=1+background)
initialize_last_layer = False
last_layers_contain_logits_only = True
label=gray-scale image (0 1)
but what i got as predicted mask en the test is black mask
can help me with this

hakS07 · 2019-06-10T20:36:46Z

@holyprince

@BeSlower , yes, the solution is work for me but there is another problem that the result is all black and no other label , but during the training process , the loss is decrease. Can anyone help me ?

hi ,i have the same problem as you, the predicted mask is a black image
did you fix it ??

hakS07 · 2019-06-10T20:50:08Z

@apolo74

Hi guys, sorry I've been disconnected from this thread... the black output is related to 2 very important settings:

Assuming that you are re-training on your own data that, for example, has 2 classes... in my toy case I mentioned I created a dataset with circles and squares. Then I have 2 classes BUT the parameter called "--num-classes" should be 4 because: 2 (own classes) + 1 (background) + 1 (ignore_label)
The pixel values in your "background" class are supposed to be 0, pixel values for your first class should be 1, for your second class should be 2 and so on... DON'T save your classes with other values like 100 or 224, you have to save your class images following the order from 1 to N
Hope this helps
/B

Hi , i tried the trainig on my custom dataset
as you said:
classe=3=1(obj)+background+ignore_label
label=gray_scale image(0,1)
in my label, there are two pixels:0 for background and 1 for object
so should i put the ignore_label in the class number calculation??
but what i got as output is a black mask
can help to fix it?

yougoforward · 2019-06-17T10:46:55Z

Hey guys! Have you ever evaluate the provided ade20k pretrained model on val set? I have test them, but both mobilenetv2_ade20k_train and xception65_ade20k_train are lower than the reported performance for about 3%-4%.
here is my evaluation script:
python eval.py
--logtostderr
--eval_split="val"
--model_variant="xception_65"
--atrous_rates= 12
--atrous_rates=24
--atrous_rates=36
--output_stride=8
--decoder_output_stride=4
--eval_crop_size=513
--eval_crop_size=513
--min_resize_value=513
--max_resize_value=513
--resize_factor=8
--aspp_with_batch_norm=true
--aspp_with_separable_conv=true
--decoder_use_separable_conv=true
--dataset="ade20k"
--checkpoint_dir="datasets/ADE20K/deeplabv3_xception_ade20k_train"
--eval_logdir="datasets/ADE20K/exp/v3plus/eval_ori"
--dataset_dir="datasets/ADE20K/tfrecord"
--max_number_of_evaluations=1
--eval_scales=0.5
--eval_scales=0.75
--eval_scales=1.0
--eval_scales=1.25
--eval_scales=1.5
--eval_scales=1.75
--add_flipped_images=true
By the way, the pretrained models for pascal and cityscapes work well. Could someone help me verify the performance or give me some advice?

hakS07 · 2019-07-15T09:37:51Z

@apolo74

Hi guys, sorry I've been disconnected from this thread... the black output is related to 2 very important settings:
Assuming that you are re-training on your own data that, for example, has 2 classes... in my toy case I mentioned I created a dataset with circles and squares. Then I have 2 classes BUT the parameter called "--num-classes" should be 4 because: 2 (own classes) + 1 (background) + 1 (ignore_label)
The pixel values in your "background" class are supposed to be 0, pixel values for your first class should be 1, for your second class should be 2 and so on... DON'T save your classes with other values like 100 or 224, you have to save your class images following the order from 1 to N
Hope this helps
/B

thx to your descriptive comment i was able to train successfully deeplab on my custom dataset(14000 images)
after 20000 iteration i tested the trained model with python code it detects fine but when i put the model on an ios application(after convert to tflite model) it gives bad and wrong segmentation
do you have any idea about using deeplab mobilenete trained model on mobile??

ma8tsch · 2019-08-21T08:35:43Z

Hey guys,
does anyone know how one can freeze layers for training? Say I want to freeze the weights of the backbone and only train the rest. Is that possible?

I would really appreciate some help on this matter. Thanks in advance

lattard · 2019-10-23T15:40:07Z

@ma8tsch did you manage to freeze some layers eventually ? If yes, can you pls provide some details ?

lolitsgab · 2019-10-29T06:31:00Z

Hi guys, sorry I've been disconnected from this thread... the black output is related to 2 very important settings:

Assuming that you are re-training on your own data that, for example, has 2 classes... in my toy case I mentioned I created a dataset with circles and squares. Then I have 2 classes BUT the parameter called "--num-classes" should be 4 because: 2 (own classes) + 1 (background) + 1 (ignore_label)

The pixel values in your "background" class are supposed to be 0, pixel values for your first class should be 1, for your second class should be 2 and so on... DON'T save your classes with other values like 100 or 224, you have to save your class images following the order from 1 to N
Hope this helps
/B

Thank you for your help :) setting

--initialize_last_layer=False\
--last_layers_contain_logits_only=True

allowed me to no longer have all black masks. I am not getting color spotted, but not acceptable, masks after only 100 steps.

To phrase what you said more clearly (for me at least), you are saying that images should be labeled with only values from 1...N where N is the number of classes, and 0 is reserved for background, and possibly even N+1 because of the ignore label (I am not utilizing this).

In other words, if you have 2 classes (circle and triangle), you will have 4 labels/indexes in your image.

index 0 = bg
index 1 = class1, say circle
index 2 = class2, say triangle
index 3 (which by default in the other datasets is 255 instead of 3) = IGNORE_LABEL

How can I confirm that this is the case for my dataset?

I'll report back tomorrow after 10,000 steps to confirm.

lolitsgab · 2019-10-29T16:45:13Z

How did y'all color index your images? It seems that my images ARE color indexed as @apolo74 specified.

Here is what my model got after 10000 steps:

This is what a color indexed image looks like in my dataset (not from same picture as above):

Any possible help?

Etheryramirezrs · 2019-11-28T00:58:21Z

Hi i am trying to run deeplab in my own dataset but i get an error when i am running the train.py it is related to the number of clases because i have 5 but apparently the program is expecting 21 like the number of classes in the VOC dataset,
Assign requires shapes of both tensors to match. lhs shape= [5] rhs shape= [21]

JinyuanShao · 2019-12-05T09:27:14Z

@aquariusjay
Hi there~, about the problem of classes imbalance, new version train_utils.py of deeplab seems to change the code, so maybe I can't add variables like label1_weight = to fix classes imbalance problem.
Could you give me some advice to edit code to add weights of classes?
Thank you very much.

PallawiSinghal · 2019-12-11T12:42:06Z

When you want to fine-tune DeepLab on other datasets, there are a few cases:

You want to re-use ALL the trained weigths: set initialize_last_layer = True (last_layers_contain_logits_only does not matter in this case).

You want to re-use ONLY the network backbone (i.e., exclude ASPP, decoder and so on): set initialize_last_layer = False and last_layers_contain_logits_only = False.

You want to re-use ALL the trained weights EXCEPT the logits (since the num_classes may be different): set initialize_last_layer = False and last_layers_contain_logits_only = True.

Hi, My loss does not change. It has become stagnant. I have tried everything mentioned related to deeplabv3+ on every blog.
I am training to detect roads. My images are of 2000x2000.
My training data has 45k images.
I have created my image in the form of PASCAL VOC. I have three kinds of pixels.
background = [0,0,0]
Void class = [255,255,255]
road = [1,1,1]
so the number of classes = 3
I am using PASCAL VOC pre trained weights.

changes in train_util.py are :
1.
ignore_weight = 0
label0_weight =10
label1_weight = 15
not_ignore_mask =
tf.to_float(tf.equal(scaled_labels, 1)) * label0_weight

tf.to_float(tf.equal(scaled_labels, 2)) * label1_weight
tf.to_float(tf.equal(scaled_labels, ignore_label)) * ignore_weight

Variables that will not be restored.

exclude_list = ['global_step','logits']
if not initialize_last_layer:
exclude_list.extend(last_layers)

my train.py

nohup python deeplab/train.py
--logtostderr
--training_number_of_steps=65000
--train_split="train"
--model_variant="xception_65"
--atrous_rates=6
--atrous_rates=12
--atrous_rates=18
--output_stride=16
--decoder_output_stride=4
--train_batch_size=2
--initialize_last_layer=False
--last_layers_contain_logits_only=True
--dataset="pascal_voc_seg"
--tf_initial_checkpoint="/data/old_model/models/research/deeplabv3_pascal_trainval/model.ckpt"
--train_logdir="/data/old_model/models/research/deeplab/mycheckpoints"
--dataset_dir="/data/models/research/deeplab/datasets/tfrecord" > my_output.log &

Please help 👍
INFO:tensorflow:global step 700: loss = 0.1759 (0.449 sec/step)
INFO:tensorflow:global step 710: loss = 0.1695 (0.655 sec/step)
INFO:tensorflow:global step 720: loss = 0.1742 (0.689 sec/step)
INFO:tensorflow:global step 730: loss = 0.1710 (0.505 sec/step)
INFO:tensorflow:global step 740: loss = 0.1708 (0.868 sec/step)
INFO:tensorflow:global step 750: loss = 0.1683 (0.632 sec/step)
INFO:tensorflow:global step 760: loss = 0.1692 (0.442 sec/step)
INFO:tensorflow:global step 770: loss = 0.1693 (0.597 sec/step)
INFO:tensorflow:global step 780: loss = 0.1665 (0.441 sec/step)
INFO:tensorflow:global step 790: loss = 0.1680 (0.548 sec/step)
INFO:tensorflow:global step 800: loss = 0.1708 (0.372 sec/step)
INFO:tensorflow:global step 810: loss = 0.1674 (0.327 sec/step)
INFO:tensorflow:global step 820: loss = 0.1666 (0.951 sec/step)
INFO:tensorflow:global step 830: loss = 0.1651 (0.557 sec/step)
INFO:tensorflow:global step 840: loss = 0.1663 (0.506 sec/step)
INFO:tensorflow:global step 850: loss = 0.1646 (0.446 sec/step)
INFO:tensorflow:global step 860: loss = 0.1666 (0.424 sec/step)
INFO:tensorflow:global step 870: loss = 0.1654 (0.520 sec/step)
INFO:tensorflow:global step 880: loss = 0.1662 (0.675 sec/step)
INFO:tensorflow:global step 890: loss = 0.1673 (0.325 sec/step)
INFO:tensorflow:global step 900: loss = 0.1633 (0.548 sec/step)
INFO:tensorflow:global step 910: loss = 0.1659 (0.374 sec/step)
INFO:tensorflow:global step 920: loss = 0.1639 (0.663 sec/step)
INFO:tensorflow:global step 930: loss = 0.1658 (0.442 sec/step)
INFO:tensorflow:global step 940: loss = 0.1654 (0.568 sec/step)
.
.
.
INFO:tensorflow:global step 17850: loss = 0.1416 (0.555 sec/step)
INFO:tensorflow:global step 17860: loss = 0.1417 (0.684 sec/step)
INFO:tensorflow:global step 17870: loss = 0.1415 (0.572 sec/step)
INFO:tensorflow:global step 17880: loss = 0.1417 (0.569 sec/step)
INFO:tensorflow:global step 17890: loss = 0.1415 (0.535 sec/step)
INFO:tensorflow:global step 17900: loss = 0.1415 (0.541 sec/step)
INFO:tensorflow:global step 17910: loss = 0.1419 (0.459 sec/step)
INFO:tensorflow:global step 17920: loss = 0.1415 (0.800 sec/step)
INFO:tensorflow:global step 17930: loss = 0.1417 (0.647 sec/step)
INFO:tensorflow:global step 17940: loss = 0.1416 (0.509 sec/step)
INFO:tensorflow:global step 17950: loss = 0.1416 (0.755 sec/step)
INFO:tensorflow:global step 17960: loss = 0.1417 (0.495 sec/step)
INFO:tensorflow:global step 17970: loss = 0.1419 (0.556 sec/step)
INFO:tensorflow:global step 17980: loss = 0.1417 (0.492 sec/step)
INFO:tensorflow:global step 17990: loss = 0.1416 (0.878 sec/step)
INFO:tensorflow:global step 18000: loss = 0.1415 (0.803 sec/step)
INFO:tensorflow:global step 18010: loss = 0.1418 (0.695 sec/step)
INFO:tensorflow:global step 18020: loss = 0.1418 (0.449 sec/step)
INFO:tensorflow:global step 18030: loss = 0.1415 (0.678 sec/step)
INFO:tensorflow:global step 18040: loss = 0.1418 (0.449 sec/step)
INFO:tensorflow:global step 18050: loss = 0.1415 (0.681 sec/step)
INFO:tensorflow:global step 18060: loss = 0.1415 (0.866 sec/step)
INFO:tensorflow:global step 18070: loss = 0.1417 (0.534 sec/step)
INFO:tensorflow:global step 18080: loss = 0.1415 (0.939 sec/step)
INFO:tensorflow:global step 18090: loss = 0.1416 (0.349 sec/step)
INFO:tensorflow:global step 18100: loss = 0.1416 (0.576 sec/step)
INFO:tensorflow:global step 18110: loss = 0.1416 (0.626 sec/step)
INFO:tensorflow:global step 18120: loss = 0.1418 (0.951 sec/step)
INFO:tensorflow:global step 18130: loss = 0.1417 (0.386 sec/step)
INFO:tensorflow:global step 18140: loss = 0.1417 (0.375 sec/step)
@aquariusjay

PallawiSinghal · 2019-12-11T13:46:23Z

As I do not have access to your dataset, and it usually takes experimental experience to tune the hyper-parameters.

@aquariusjay Hi, May I know how we can quantify our dataset to find out these values.
ignore_weight
label0_weight
label1_weight

HanChen-HUST · 2019-12-29T01:42:42Z

@PallawiSinghal did u solve it？I also want to change the loss_weight

HanChen-HUST · 2019-12-29T01:43:45Z

@jinyuan30 did u solve it？I also want to change the loss_weight

HanChen-HUST · 2019-12-29T01:45:01Z

@aquariusjay Hi there~, about the problem of classes imbalance, new version train_utils.py of deeplab seems to change the code, so maybe I can't add variables like label1_weight = to fix classes imbalance problem.
Could you give me some advice to edit code to add weights of classes?
Thank you very much.

LightingX · 2020-03-17T12:20:58Z

@aquariusjay Hi there~, about the problem of classes imbalance, new version train_utils.py of deeplab seems to change the code, so maybe I can't add variables like label1_weight = to fix classes imbalance problem.
Could you give me some advice to edit code to add weights of classes?
Thank you very much.

Hello, it seems that I meet the same problem, have you solved it yet?

Alive1024 · 2020-04-03T17:06:26Z

@LightingX Hi,friend! Have you figured out how to adjust the loss weight in new version of train_utils.py?
I tried to change the label_weights from None to a Python list in the common.py, but I got a ValueError: Subscripts with ellipses are not yet supported

claudiourbina · 2020-04-16T22:49:29Z

@aquariusjay Hi there~, about the problem of classes imbalance, new version train_utils.py of deeplab seems to change the code, so maybe I can't add variables like label1_weight = to fix classes imbalance problem.
Could you give me some advice to edit code to add weights of classes?
Thank you very much.

Did you solve it? I have the same problem now :/.

LightingX · 2020-04-17T09:18:56Z

@Alive1024 @claudiourbina Hey guys, in the latest implemented version, it seems we can adjust the weight by params. When training, add label_weights param to the train params list. For example, if I have 2 classes and their weights are 0.01 and 1, I can add this to the train params:

--label_weights={0.01,1.0}

RostyslavBryiovskyi · 2021-03-29T16:58:22Z

@essalahsouad Hi! Did you solved problem with black images ? Still actual for me

walkerlala mentioned this issue Apr 12, 2018

[deeplab] [feature request] update FAQ for 'train the model on other datasets' #3940

Closed

95xueqian mentioned this issue May 4, 2018

[deeplab v3+] I want to use the model on my own dataset, which just has 8 classes #4170

Closed

aptlin mentioned this issue Jun 10, 2019

[deeplab] Add Cityscapes and ADE20K to the notebook demo #6991

Closed

ravikyram added models:research models that come under research directory type:support labels Jul 10, 2020

ravikyram assigned gpapan, YknZhu and aquariusjay Jul 10, 2020

crouchjt mentioned this issue Jul 18, 2020

Extending the Pascal Model with more classes #8905

Open

[deeplab] Training deeplab model with ADE20K dataset #3730

[deeplab] Training deeplab model with ADE20K dataset #3730

Comments

walkerlala commented Mar 24, 2018 • edited Loading

System information

Describe the problem

aquariusjay commented Mar 24, 2018

walkerlala commented Mar 25, 2018

walkerlala commented Mar 25, 2018

aquariusjay commented Mar 26, 2018

brett-whitford commented Mar 30, 2018

walkerlala commented Mar 31, 2018 via email

wonderit commented Apr 3, 2018 • edited Loading

shipengai commented Apr 3, 2018

walkerlala commented Apr 3, 2018

walkerlala commented Apr 3, 2018 • edited Loading

walkerlala commented Apr 3, 2018

shipengai commented Apr 4, 2018

shipengai commented Apr 8, 2018

hhwxxx commented Apr 8, 2018 • edited Loading

lydialixia commented Apr 9, 2018 • edited Loading

hhwxxx commented Apr 10, 2018 • edited Loading

aquariusjay commented Apr 10, 2018

georgosgeorgos commented Apr 10, 2018

walkerlala commented Apr 12, 2018 • edited Loading

cfosco commented Apr 15, 2018

cfosco commented Apr 15, 2018

jiyongma commented Apr 16, 2018

qmy612 commented Apr 19, 2018

hhwxxx commented Apr 19, 2018

BeSlower commented May 1, 2018

holyprince commented May 3, 2018

xianshunw commented May 5, 2018

qmy612 commented May 6, 2018

ajinkya933 commented Apr 16, 2019

apolo74 commented Apr 16, 2019

hakS07 commented Jun 10, 2019 • edited Loading

hakS07 commented Jun 10, 2019

hakS07 commented Jun 10, 2019 • edited Loading

yougoforward commented Jun 17, 2019

hakS07 commented Jul 15, 2019 • edited Loading

ma8tsch commented Aug 21, 2019

lattard commented Oct 23, 2019

lolitsgab commented Oct 29, 2019 • edited Loading

lolitsgab commented Oct 29, 2019 • edited Loading

Etheryramirezrs commented Nov 28, 2019

JinyuanShao commented Dec 5, 2019

PallawiSinghal commented Dec 11, 2019 • edited Loading

Variables that will not be restored.

PallawiSinghal commented Dec 11, 2019

HanChen-HUST commented Dec 29, 2019

HanChen-HUST commented Dec 29, 2019

HanChen-HUST commented Dec 29, 2019

LightingX commented Mar 17, 2020

Alive1024 commented Apr 3, 2020

claudiourbina commented Apr 16, 2020

LightingX commented Apr 17, 2020

RostyslavBryiovskyi commented Mar 29, 2021 • edited Loading

walkerlala commented Mar 24, 2018 •

edited

Loading

wonderit commented Apr 3, 2018 •

edited

Loading

walkerlala commented Apr 3, 2018 •

edited

Loading

hhwxxx commented Apr 8, 2018 •

edited

Loading

lydialixia commented Apr 9, 2018 •

edited

Loading

hhwxxx commented Apr 10, 2018 •

edited

Loading

walkerlala commented Apr 12, 2018 •

edited

Loading

hakS07 commented Jun 10, 2019 •

edited

Loading

hakS07 commented Jun 10, 2019 •

edited

Loading

hakS07 commented Jul 15, 2019 •

edited

Loading

lolitsgab commented Oct 29, 2019 •

edited

Loading

lolitsgab commented Oct 29, 2019 •

edited

Loading

PallawiSinghal commented Dec 11, 2019 •

edited

Loading

RostyslavBryiovskyi commented Mar 29, 2021 •

edited

Loading