Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image classification reference implementation is failing on Ubuntu 22.04 #643

Closed
arjunsuresh opened this issue May 8, 2023 · 3 comments
Closed

Comments

@arjunsuresh
Copy link
Contributor

arjunsuresh commented May 8, 2023

I'm trying to run image classification on Ubuntu 22.04, python 3.10 and tensorflow 2.12. Currently getting the below error.

TypeError: PolynomialDecayWithWarmup.__call__() missing 1 required positional argument: 'step'
I0508 12:22:25.691070 140615434081856 coordinator.py:213] Error reported to Coordinator: PolynomialDecayWithWarmup.__call__() missing 1 required positional argument: 'step'

Detailed error

python3 ./resnet_ctl_imagenet_main.py --base_learning_rate=8.5 --batch_size=1024 --clean --data_dir=../../../imagenet/tf_records/ --datasets_num_private_threads=32 --dtype=fp32 --device_warmup_steps=1 --noenable_device_warmup --enable_eager --noenable_xla --epochs_between_evals=4 --noeval_dataset_cache --eval_offset_epochs=2 --eval_prefetch_batchs=192 --label_smoothing=0.1 --lars_epsilon=0 --log_steps=125 --lr_schedule=polynomial --model_dir=outputs --momentum=0.9 --num_accumulation_steps=2 --num_classes=1000 --num_gpus=1 --optimizer=LARS --noreport_accuracy_metrics --single_l2_loss_op --noskip_eval --steps_per_loop=1252 --target_accuracy=0.759 --notf_data_experimental_slack --tf_gpu_thread_mode=gpu_private --notrace_warmup --train_epochs=41 --notraining_dataset_cache --training_prefetch_batchs=128 --nouse_synthetic_data --warmup_epochs=5 --weight_decay=0.0002
2023-05-08 12:22:23.122980: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-05-08 12:22:23.137739: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
:::MLL 1683544944.150 cache_clear: {"value": true, "metadata": {"lineno": 114, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.149741 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 cache_clear: {"value": true, "metadata": {"lineno": 114, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.150 init_start: {"value": null, "metadata": {"lineno": 115, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.149899 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 init_start: {"value": null, "metadata": {"lineno": 115, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.150 submission_benchmark: {"value": "resnet", "metadata": {"lineno": 116, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.150007 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 submission_benchmark: {"value": "resnet", "metadata": {"lineno": 116, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.150 submission_division: {"value": "closed", "metadata": {"lineno": 117, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.150108 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 submission_division: {"value": "closed", "metadata": {"lineno": 117, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.150 submission_org: {"value": "google", "metadata": {"lineno": 118, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.150207 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 submission_org: {"value": "google", "metadata": {"lineno": 118, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.150 submission_platform: {"value": "gpu-v100-1", "metadata": {"lineno": 119, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.150308 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 submission_platform: {"value": "gpu-v100-1", "metadata": {"lineno": 119, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.150 submission_status: {"value": "cloud", "metadata": {"lineno": 122, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.150408 140628087959552 mlp_log.py:80] :::MLL 1683544944.150 submission_status: {"value": "cloud", "metadata": {"lineno": 122, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.150436 140628087959552 common.py:617] Module ./resnet_ctl_imagenet_main.py:
I0508 12:22:24.150570 140628087959552 common.py:620] 	 flags_obj.use_tf_function = True
I0508 12:22:24.150585 140628087959552 common.py:620] 	 flags_obj.single_l2_loss_op = True
I0508 12:22:24.150598 140628087959552 common.py:620] 	 flags_obj.cache_decoded_image = False
I0508 12:22:24.150610 140628087959552 common.py:620] 	 flags_obj.enable_device_warmup = False
I0508 12:22:24.150622 140628087959552 common.py:620] 	 flags_obj.device_warmup_steps = 1
I0508 12:22:24.150634 140628087959552 common.py:620] 	 flags_obj.num_replicas = 32
I0508 12:22:24.150646 140628087959552 common.py:617] Module absl.app:
I0508 12:22:24.150659 140628087959552 common.py:620] 	 flags_obj.run_with_pdb = False
I0508 12:22:24.150671 140628087959552 common.py:620] 	 flags_obj.pdb_post_mortem = False
I0508 12:22:24.150684 140628087959552 common.py:620] 	 flags_obj.pdb = False
I0508 12:22:24.150696 140628087959552 common.py:620] 	 flags_obj.run_with_profiling = False
I0508 12:22:24.150707 140628087959552 common.py:620] 	 flags_obj.profile_file = None
I0508 12:22:24.150719 140628087959552 common.py:620] 	 flags_obj.use_cprofile_for_profiling = True
I0508 12:22:24.150730 140628087959552 common.py:620] 	 flags_obj.only_check_args = False
I0508 12:22:24.150742 140628087959552 common.py:620] 	 flags_obj.help = False
I0508 12:22:24.150753 140628087959552 common.py:620] 	 flags_obj.helpshort = False
I0508 12:22:24.150764 140628087959552 common.py:620] 	 flags_obj.helpfull = False
I0508 12:22:24.150775 140628087959552 common.py:620] 	 flags_obj.helpxml = False
I0508 12:22:24.150787 140628087959552 common.py:617] Module absl.logging:
I0508 12:22:24.150799 140628087959552 common.py:620] 	 flags_obj.logtostderr = False
I0508 12:22:24.150810 140628087959552 common.py:620] 	 flags_obj.alsologtostderr = False
I0508 12:22:24.150823 140628087959552 common.py:620] 	 flags_obj.log_dir = 
I0508 12:22:24.150835 140628087959552 common.py:620] 	 flags_obj.verbosity = 0
I0508 12:22:24.150847 140628087959552 common.py:620] 	 flags_obj.logger_levels = {}
I0508 12:22:24.150860 140628087959552 common.py:620] 	 flags_obj.stderrthreshold = fatal
I0508 12:22:24.150871 140628087959552 common.py:620] 	 flags_obj.showprefixforinfo = True
I0508 12:22:24.150883 140628087959552 common.py:617] Module absl.testing.absltest:
I0508 12:22:24.150895 140628087959552 common.py:620] 	 flags_obj.test_srcdir = 
I0508 12:22:24.150907 140628087959552 common.py:620] 	 flags_obj.test_tmpdir = /tmp/absl_testing
I0508 12:22:24.150918 140628087959552 common.py:620] 	 flags_obj.test_random_seed = 301
I0508 12:22:24.150929 140628087959552 common.py:620] 	 flags_obj.test_randomize_ordering_seed = 
I0508 12:22:24.150941 140628087959552 common.py:620] 	 flags_obj.xml_output_file = 
I0508 12:22:24.150952 140628087959552 common.py:617] Module common:
I0508 12:22:24.150964 140628087959552 common.py:620] 	 flags_obj.enable_eager = True
I0508 12:22:24.150975 140628087959552 common.py:620] 	 flags_obj.skip_eval = False
I0508 12:22:24.150986 140628087959552 common.py:620] 	 flags_obj.set_learning_phase_to_train = True
I0508 12:22:24.150998 140628087959552 common.py:620] 	 flags_obj.explicit_gpu_placement = False
I0508 12:22:24.151009 140628087959552 common.py:620] 	 flags_obj.use_trivial_model = False
I0508 12:22:24.151023 140628087959552 common.py:620] 	 flags_obj.report_accuracy_metrics = False
I0508 12:22:24.151035 140628087959552 common.py:620] 	 flags_obj.lr_schedule = polynomial
I0508 12:22:24.151046 140628087959552 common.py:620] 	 flags_obj.enable_tensorboard = False
I0508 12:22:24.151057 140628087959552 common.py:620] 	 flags_obj.train_steps = None
I0508 12:22:24.151069 140628087959552 common.py:620] 	 flags_obj.profile_steps = None
I0508 12:22:24.151080 140628087959552 common.py:620] 	 flags_obj.batchnorm_spatial_persistent = True
I0508 12:22:24.151092 140628087959552 common.py:620] 	 flags_obj.enable_get_next_as_optional = False
I0508 12:22:24.151104 140628087959552 common.py:620] 	 flags_obj.enable_checkpoint_and_export = False
I0508 12:22:24.151115 140628087959552 common.py:620] 	 flags_obj.tpu = 
I0508 12:22:24.151127 140628087959552 common.py:620] 	 flags_obj.tpu_zone = 
I0508 12:22:24.151138 140628087959552 common.py:620] 	 flags_obj.steps_per_loop = 1252
I0508 12:22:24.151149 140628087959552 common.py:620] 	 flags_obj.use_tf_while_loop = True
I0508 12:22:24.151160 140628087959552 common.py:620] 	 flags_obj.use_tf_keras_layers = False
I0508 12:22:24.151171 140628087959552 common.py:620] 	 flags_obj.base_learning_rate = 8.5
I0508 12:22:24.151183 140628087959552 common.py:620] 	 flags_obj.optimizer = LARS
I0508 12:22:24.151194 140628087959552 common.py:620] 	 flags_obj.drop_train_remainder = True
I0508 12:22:24.151205 140628087959552 common.py:620] 	 flags_obj.drop_eval_remainder = False
I0508 12:22:24.151217 140628087959552 common.py:620] 	 flags_obj.label_smoothing = 0.1
I0508 12:22:24.151229 140628087959552 common.py:620] 	 flags_obj.num_classes = 1000
I0508 12:22:24.151244 140628087959552 common.py:620] 	 flags_obj.eval_offset_epochs = 2
I0508 12:22:24.151255 140628087959552 common.py:620] 	 flags_obj.target_accuracy = 0.759
I0508 12:22:24.151266 140628087959552 common.py:617] Module lars_util:
I0508 12:22:24.151278 140628087959552 common.py:620] 	 flags_obj.end_learning_rate = None
I0508 12:22:24.151288 140628087959552 common.py:620] 	 flags_obj.lars_epsilon = 0.0
I0508 12:22:24.151300 140628087959552 common.py:620] 	 flags_obj.warmup_epochs = 5.0
I0508 12:22:24.151311 140628087959552 common.py:620] 	 flags_obj.momentum = 0.9
I0508 12:22:24.151322 140628087959552 common.py:617] Module resnet_model:
I0508 12:22:24.151334 140628087959552 common.py:620] 	 flags_obj.weight_decay = 0.0002
I0508 12:22:24.151345 140628087959552 common.py:620] 	 flags_obj.num_accumulation_steps = 2
I0508 12:22:24.151357 140628087959552 common.py:617] Module resnet_runnable:
I0508 12:22:24.151368 140628087959552 common.py:620] 	 flags_obj.trace_warmup = False
I0508 12:22:24.151380 140628087959552 common.py:617] Module tensorflow.python.ops.parallel_for.pfor:
I0508 12:22:24.151391 140628087959552 common.py:620] 	 flags_obj.op_conversion_fallback_to_while_loop = True
I0508 12:22:24.151402 140628087959552 common.py:617] Module tensorflow.python.tpu.client.client:
I0508 12:22:24.151414 140628087959552 common.py:620] 	 flags_obj.runtime_oom_exit = True
I0508 12:22:24.151425 140628087959552 common.py:620] 	 flags_obj.hbm_oom_exit = True
I0508 12:22:24.151436 140628087959552 common.py:617] Module tensorflow.python.tpu.tensor_tracer_flags:
I0508 12:22:24.151448 140628087959552 common.py:620] 	 flags_obj.delta_threshold = 0.5
I0508 12:22:24.151459 140628087959552 common.py:620] 	 flags_obj.tt_check_filter = False
I0508 12:22:24.151470 140628087959552 common.py:620] 	 flags_obj.tt_single_core_summaries = False
I0508 12:22:24.151482 140628087959552 common.py:617] Module tf2_common.utils.flags._base:
I0508 12:22:24.151493 140628087959552 common.py:620] 	 flags_obj.data_dir = ../../../imagenet/tf_records/
I0508 12:22:24.151504 140628087959552 common.py:620] 	 flags_obj.model_dir = outputs
I0508 12:22:24.151515 140628087959552 common.py:620] 	 flags_obj.clean = True
I0508 12:22:24.151526 140628087959552 common.py:620] 	 flags_obj.train_epochs = 41
I0508 12:22:24.151538 140628087959552 common.py:620] 	 flags_obj.epochs_between_evals = 4
I0508 12:22:24.151549 140628087959552 common.py:620] 	 flags_obj.batch_size = 1024
I0508 12:22:24.151561 140628087959552 common.py:620] 	 flags_obj.num_gpus = 1
I0508 12:22:24.151572 140628087959552 common.py:620] 	 flags_obj.run_eagerly = False
I0508 12:22:24.151583 140628087959552 common.py:620] 	 flags_obj.distribution_strategy = mirrored
I0508 12:22:24.151594 140628087959552 common.py:617] Module tf2_common.utils.flags._benchmark:
I0508 12:22:24.151606 140628087959552 common.py:620] 	 flags_obj.benchmark_logger_type = BaseBenchmarkLogger
I0508 12:22:24.151617 140628087959552 common.py:620] 	 flags_obj.benchmark_test_id = None
I0508 12:22:24.151628 140628087959552 common.py:620] 	 flags_obj.log_steps = 125
I0508 12:22:24.151639 140628087959552 common.py:620] 	 flags_obj.benchmark_log_dir = None
I0508 12:22:24.151650 140628087959552 common.py:620] 	 flags_obj.gcp_project = None
I0508 12:22:24.151661 140628087959552 common.py:620] 	 flags_obj.bigquery_data_set = test_benchmark
I0508 12:22:24.151672 140628087959552 common.py:620] 	 flags_obj.bigquery_run_table = benchmark_run
I0508 12:22:24.151683 140628087959552 common.py:620] 	 flags_obj.bigquery_run_status_table = benchmark_run_status
I0508 12:22:24.151695 140628087959552 common.py:620] 	 flags_obj.bigquery_metric_table = benchmark_metric
I0508 12:22:24.151706 140628087959552 common.py:617] Module tf2_common.utils.flags._distribution:
I0508 12:22:24.151717 140628087959552 common.py:620] 	 flags_obj.worker_hosts = None
I0508 12:22:24.151728 140628087959552 common.py:620] 	 flags_obj.task_index = -1
I0508 12:22:24.151739 140628087959552 common.py:617] Module tf2_common.utils.flags._misc:
I0508 12:22:24.151751 140628087959552 common.py:620] 	 flags_obj.data_format = None
I0508 12:22:24.151762 140628087959552 common.py:617] Module tf2_common.utils.flags._performance:
I0508 12:22:24.151773 140628087959552 common.py:620] 	 flags_obj.use_synthetic_data = False
I0508 12:22:24.151784 140628087959552 common.py:620] 	 flags_obj.dtype = fp32
I0508 12:22:24.151796 140628087959552 common.py:620] 	 flags_obj.loss_scale = None
I0508 12:22:24.151807 140628087959552 common.py:620] 	 flags_obj.fp16_implementation = keras
I0508 12:22:24.151818 140628087959552 common.py:620] 	 flags_obj.all_reduce_alg = None
I0508 12:22:24.151829 140628087959552 common.py:620] 	 flags_obj.num_packs = 1
I0508 12:22:24.151840 140628087959552 common.py:620] 	 flags_obj.tf_gpu_thread_mode = gpu_private
I0508 12:22:24.151852 140628087959552 common.py:620] 	 flags_obj.per_gpu_thread_count = 0
I0508 12:22:24.151863 140628087959552 common.py:620] 	 flags_obj.datasets_num_private_threads = 32
I0508 12:22:24.151874 140628087959552 common.py:620] 	 flags_obj.training_dataset_cache = False
I0508 12:22:24.151886 140628087959552 common.py:620] 	 flags_obj.training_prefetch_batchs = 128
I0508 12:22:24.151897 140628087959552 common.py:620] 	 flags_obj.eval_dataset_cache = False
I0508 12:22:24.151908 140628087959552 common.py:620] 	 flags_obj.eval_prefetch_batchs = 192
I0508 12:22:24.151919 140628087959552 common.py:620] 	 flags_obj.tf_data_experimental_slack = False
I0508 12:22:24.151931 140628087959552 common.py:620] 	 flags_obj.enable_xla = False
I0508 12:22:24.151942 140628087959552 common.py:620] 	 flags_obj.force_v2_in_keras_compile = None
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0
W0508 12:22:24.153517 140628087959552 cross_device_ops.py:1382] Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0
2023-05-08 12:22:24.159656: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1780] (One-time warning): Not using XLA:CPU for cluster.

If you want XLA:CPU, do one of the following:

 - set the TF_XLA_FLAGS to include "--tf_xla_cpu_global_jit", or
 - set cpu_global_jit to true on this session's OptimizerOptions, or
 - use experimental_jit_scope, or
 - use tf.function(jit_compile=True).

To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a
proper command-line flag, not via TF_XLA_FLAGS).
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0508 12:22:24.161878 140628087959552 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
:::MLL 1683544944.162 global_batch_size: {"value": 1024, "metadata": {"lineno": 155, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.162150 140628087959552 mlp_log.py:80] :::MLL 1683544944.162 global_batch_size: {"value": 1024, "metadata": {"lineno": 155, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.162 train_samples: {"value": 1281167, "metadata": {"lineno": 156, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.162269 140628087959552 mlp_log.py:80] :::MLL 1683544944.162 train_samples: {"value": 1281167, "metadata": {"lineno": 156, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.162 eval_samples: {"value": 50000, "metadata": {"lineno": 158, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.162377 140628087959552 mlp_log.py:80] :::MLL 1683544944.162 eval_samples: {"value": 50000, "metadata": {"lineno": 158, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.162 model_bn_span: {"value": 1024, "metadata": {"lineno": 160, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.162482 140628087959552 mlp_log.py:80] :::MLL 1683544944.162 model_bn_span: {"value": 1024, "metadata": {"lineno": 160, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.162510 140628087959552 resnet_ctl_imagenet_main.py:169] Training 42 epochs, each epoch has 1251 steps, total steps: 52542; Eval 49 steps
:::MLL 1683544944.641 opt_name: {"value": "lars", "metadata": {"lineno": 101, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.640697 140628087959552 mlp_log.py:80] :::MLL 1683544944.641 opt_name: {"value": "lars", "metadata": {"lineno": 101, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.641 lars_epsilon: {"value": 0.0, "metadata": {"lineno": 103, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.640894 140628087959552 mlp_log.py:80] :::MLL 1683544944.641 lars_epsilon: {"value": 0.0, "metadata": {"lineno": 103, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.641 lars_opt_weight_decay: {"value": 0.0002, "metadata": {"lineno": 104, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641026 140628087959552 mlp_log.py:80] :::MLL 1683544944.641 lars_opt_weight_decay: {"value": 0.0002, "metadata": {"lineno": 104, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.641 lars_opt_base_learning_rate: {"value": 8.5, "metadata": {"lineno": 106, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641152 140628087959552 mlp_log.py:80] :::MLL 1683544944.641 lars_opt_base_learning_rate: {"value": 8.5, "metadata": {"lineno": 106, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.641 lars_opt_learning_rate_warmup_epochs: {"value": 5.0, "metadata": {"lineno": 108, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641273 140628087959552 mlp_log.py:80] :::MLL 1683544944.641 lars_opt_learning_rate_warmup_epochs: {"value": 5.0, "metadata": {"lineno": 108, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.641 lars_opt_end_learning_rate: {"value": 0.0001, "metadata": {"lineno": 110, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641392 140628087959552 mlp_log.py:80] :::MLL 1683544944.641 lars_opt_end_learning_rate: {"value": 0.0001, "metadata": {"lineno": 110, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.642 lars_opt_learning_rate_decay_steps: {"value": 45037, "metadata": {"lineno": 115, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641583 140628087959552 mlp_log.py:80] :::MLL 1683544944.642 lars_opt_learning_rate_decay_steps: {"value": 45037, "metadata": {"lineno": 115, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.642 lars_opt_learning_rate_decay_poly_power: {"value": 2.0, "metadata": {"lineno": 117, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641716 140628087959552 mlp_log.py:80] :::MLL 1683544944.642 lars_opt_learning_rate_decay_poly_power: {"value": 2.0, "metadata": {"lineno": 117, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.642 lars_opt_momentum: {"value": 0.9, "metadata": {"lineno": 119, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
I0508 12:22:24.641841 140628087959552 mlp_log.py:80] :::MLL 1683544944.642 lars_opt_momentum: {"value": 0.9, "metadata": {"lineno": 119, "file": "/mored/home/arjun/training/image_classification/tensorflow2/lars_util.py"}}
:::MLL 1683544944.687 init_stop: {"value": null, "metadata": {"lineno": 223, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.686844 140628087959552 mlp_log.py:80] :::MLL 1683544944.687 init_stop: {"value": null, "metadata": {"lineno": 223, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.687 run_start: {"value": null, "metadata": {"lineno": 232, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.687031 140628087959552 mlp_log.py:80] :::MLL 1683544944.687 run_start: {"value": null, "metadata": {"lineno": 232, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
:::MLL 1683544944.687 block_start: {"value": null, "metadata": {"first_epoch_num": 1, "epoch_count": 2, "lineno": 233, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.687142 140628087959552 mlp_log.py:80] :::MLL 1683544944.687 block_start: {"value": null, "metadata": {"first_epoch_num": 1, "epoch_count": 2, "lineno": 233, "file": "/mored/home/arjun/training/image_classification/tensorflow2/./resnet_ctl_imagenet_main.py"}}
I0508 12:22:24.689408 140628087959552 controller.py:247] Train at step 0 of 52542
I0508 12:22:24.689451 140628087959552 controller.py:251] Entering training loop with 1251 steps, at step 0 of 52542
WARNING:tensorflow:From /mored/home/arjun/training/image_classification/tensorflow2/tf2_common/training/utils.py:139: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W0508 12:22:24.689529 140628087959552 deprecation.py:364] From /mored/home/arjun/training/image_classification/tensorflow2/tf2_common/training/utils.py:139: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
I0508 12:22:24.691895 140628087959552 imagenet_preprocessing.py:338] Sharding the dataset: input_pipeline_id=0 num_input_pipelines=1
W0508 12:22:24.699523 140628087959552 options.py:599] options.experimental_threading is deprecated. Use options.threading instead.
I0508 12:22:24.700093 140628087959552 imagenet_preprocessing.py:104] datasets_num_private_threads: 32
I0508 12:22:24.700675 140628087959552 imagenet_preprocessing.py:118] Num classes: 1000
I0508 12:22:24.700706 140628087959552 imagenet_preprocessing.py:119] One hot: True
2023-05-08 12:22:24.965010: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [1024]
	 [[{{node Placeholder/_0}}]]
2023-05-08 12:22:24.965187: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [1024]
	 [[{{node Placeholder/_0}}]]
2023-05-08 12:22:25.004279: W tensorflow/core/framework/dataset.cc:807] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
2023-05-08 12:22:25.004420: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype variant
	 [[{{node Placeholder/_0}}]]
2023-05-08 12:22:25.028428: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'num_steps' with dtype int32
	 [[{{node num_steps}}]]
WARNING:tensorflow:From /home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py:458: calling function (from tensorflow.python.eager.polymorphic_function.polymorphic_function) with experimental_compile is deprecated and will be removed in a future version.
Instructions for updating:
experimental_compile is deprecated, use jit_compile instead
W0508 12:22:25.188507 140628087959552 deprecation.py:569] From /home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py:458: calling function (from tensorflow.python.eager.polymorphic_function.polymorphic_function) with experimental_compile is deprecated and will be removed in a future version.
Instructions for updating:
experimental_compile is deprecated, use jit_compile instead
INFO:tensorflow:Error reported to Coordinator: PolynomialDecayWithWarmup.__call__() missing 1 required positional argument: 'step'
Traceback (most recent call last):
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/training/coordinator.py", line 293, in stop_on_exception
    yield
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/distribute/mirrored_run.py", line 387, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/tmp/__autograph_generated_file0rl97hcp.py", line 144, in _apply_grads_and_clear_for_each_replica
    ag__.converted_call(ag__.ld(self).optimizer.apply_gradients, (ag__.converted_call(ag__.ld(zip), (ag__.ld(replica_accum_grads), ag__.ld(self).training_vars), None, fscope_3),), None, fscope_3)
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
    return f(*args)
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 665, in apply_gradients
    apply_state = self._prepare(var_list)
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 947, in _prepare
    self._prepare_local(var_device, var_dtype, apply_state)
  File "/mored/home/arjun/training/image_classification/tensorflow2/lars_optimizer.py", line 114, in _prepare_local
    lr_t = self._get_hyper("learning_rate", var_dtype)
  File "/home/arjun/.local/lib/python3.10/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 804, in _get_hyper
    value = value()
TypeError: PolynomialDecayWithWarmup.__call__() missing 1 required positional argument: 'step'
I0508 12:22:25.691070 140615434081856 coordinator.py:213] Error reported to Coordinator: PolynomialDecayWithWarmup.__call__() missing 1 required positional argument: 'step'
@Daming-wang
Copy link

Try downgrading your TensorFlow version to 2.4.x, as well as its corresponding CUDA and cuDNN versions.

@arjunsuresh
Copy link
Contributor Author

Thank you @Daming-wang for the suggestion. We'll try that but for current submission we'll be going with Nvidia code.

For the reference implementations should we document the version requirements somewhere as a lot of people will be trying that.

@hiwotadese
Copy link
Contributor

Closing because the benchmark is retired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants