-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf Softmax input dimension error #11
Comments
Hi Mo, Thanks for reaching out! Could you please clarify which gin configuration files you are using? E.g. please post the exact command you are running and if you made any changes to the gin configurations.
Best |
Hi Dr. Tim, Thank you so much for your help. cd /home/47800/SocialNavigation_v2/human-scene-transformer/human_scene_transformer
export PYTHONPATH=/path/to/directory:/home/47800/SocialNavigation_v2/human-scene-transformer
python3 train.py --model_base_dir=./model/jrdb --gin_files=./config/jrdb/training_params.gin --gin_files=./config/jrdb/model_params.gin --gin_files=./config/jrdb/dataset_params.gin --gin_files=./config/jrdb/metrics.gin --dataset=JRDB With the differences --- /home/47800/originalHST/human-scene-transformer/human_scene_transformer/config/jrdb/dataset_params.gin 2023-10-18 02:40:21.036342842 +0000
+++ /home/47800/SocialNavigation_v2/human-scene-transformer/human_scene_transformer/config/jrdb/dataset_params.gin 2023-10-15 21:01:53.859915432 +0000
@@ -55,7 +55,7 @@
'tressider-2019-04-26_3_test']
-JRDBDatasetParams.path = '<dataset_path>'
+JRDBDatasetParams.path = '/home/47800/SocialNavigation_v2/pre_tf_dataset'
JRDBDatasetParams.train_scenes = %TRAIN_SCENES
JRDBDatasetParams.eval_scenes = %TEST_SCENES In the '/home/47800/SocialNavigation_v2/pre_tf_dataset', I have (base) 47800@instance-3:~/originalHST$ ls /home/47800/SocialNavigation_v2/pre_tf_dataset
bytes-cafe-2019-02-07_0 hewlett-class-2019-01-23_0_test outdoor-coupa-cafe-2019-02-06_0_test
clark-center-2019-02-28_0 hewlett-class-2019-01-23_1_test packard-poster-session-2019-03-20_0
clark-center-2019-02-28_1 hewlett-packard-intersection-2019-01-24_0 packard-poster-session-2019-03-20_1
clark-center-intersection-2019-02-28_0 huang-2-2019-01-25_0 packard-poster-session-2019-03-20_2
cubberly-auditorium-2019-04-22_0 huang-2-2019-01-25_1_test quarry-road-2019-02-28_0_test
cubberly-auditorium-2019-04-22_1_test huang-basement-2019-01-25_0 serra-street-2019-01-30_0_test
discovery-walk-2019-02-28_0_test huang-intersection-2019-01-22_0_test stlc-111-2019-04-19_0
discovery-walk-2019-02-28_1_test huang-lane-2019-02-12_0 stlc-111-2019-04-19_1_test
food-trucks-2019-02-12_0_test indoor-coupa-cafe-2019-02-06_0_test stlc-111-2019-04-19_2_test
forbes-cafe-2019-01-22_0 jordan-hall-2019-04-22_0 svl-meeting-gates-2-2019-04-08_0
gates-159-group-meeting-2019-04-03_0 lomita-serra-intersection-2019-01-30_0_test svl-meeting-gates-2-2019-04-08_1
gates-ai-lab-2019-02-08_0 memorial-court-2019-03-16_0 tressider-2019-03-16_0
gates-ai-lab-2019-04-17_0_test meyer-green-2019-03-16_0 tressider-2019-03-16_1
gates-basement-elevators-2019-01-17_0_test meyer-green-2019-03-16_1_test tressider-2019-03-16_2_test
gates-basement-elevators-2019-01-17_1 nvidia-aud-2019-01-25_0_test tressider-2019-04-26_0_test
gates-foyer-2019-01-17_0_test nvidia-aud-2019-04-18_0 tressider-2019-04-26_1_test
gates-to-clark-2019-02-28_0_test nvidia-aud-2019-04-18_1_test tressider-2019-04-26_2
gates-to-clark-2019-02-28_1 nvidia-aud-2019-04-18_2_test tressider-2019-04-26_3_test Updated: tf.print('\n**************ds_train:**************\n',ds_train)
tf.print('\n**************dist_train_dataset:**************\n',dist_train_dataset) got **************ds_train:**************
<_ShuffleDataset element_spec={'agents/position': TensorSpec(shape=(8, 19, 2), dtype=tf.float32, name=None), 'agents/orientation': TensorSpec(shape=(8, 19, 1), dtype=tf.float32, name=None), 'agents/keypoints': TensorSpec(shape=(8, 19, 99), dtype=tf.float32, name=None), 'robot/position': TensorSpec(shape=(19, 3), dtype=tf.float32, name=None), 'scene/id': TensorSpec(shape=(), dtype=tf.string, name=None), 'scene/timestamp': TensorSpec(shape=(), dtype=tf.int64, name=None), 'agents/gaze': TensorSpec(shape=(8, 19, 1), dtype=tf.float32, name=None)}>
**************dist_train_dataset:**************
<tensorflow.python.distribute.input_lib.DistributedDataset object at 0x7f39da9e83a0> Q: Is there might be some problems with ' scene/timestamp': TensorSpec(shape=(), dtype=tf.int64, name=None)' ? ran tf.print('samples:',
tf.data.experimental.sample_from_datasets(ds_train, weights=None, seed=None, stop_on_empty_dataset=False)
) got File "/home/47800/miniconda3/envs/hstpy310/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 526, in __len__
raise TypeError("The dataset is infinite.")
TypeError: The dataset is infinite. Q: Is the dataset supposed to be infinite? |
Hi Mo, Unfortunately, we are still struggling to reproduce the error. Does the error occur in (or even before) the very first training iteration? Does the same error occur when you are not training but evaluating? The shapes you are getting from
The timestep is a scalar, so this is expected.
Yes! once the train dataset reaches its end it will be repeated human-scene-transformer/human_scene_transformer/jrdb/input_fn.py Lines 696 to 697 in 7e9b927
|
Yes, at the first iteration.
/home/47800/miniconda3/envs/hstpy310/lib/python3.10/site-packages/tensorflow/python/data/ops/map_op.py:35: UserWarning: The `deterministic` argument has no effect unless the `num_parallel_calls` argument is specified.
warnings.warn("The `deterministic` argument has no effect unless the "
2023-10-18 18:13:54.933894: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at mkl_softmax_op.cc:252 : ABORTED: Input dims must be <= 5 and >=1
Traceback (most recent call last):
File "/home/47800/SocialNavigation_v2/human-scene-transformer/human_scene_transformer/jrdb/eval.py", line 166, in <module>
app.run(main)
File "/home/47800/miniconda3/envs/hstpy310/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/47800/miniconda3/envs/hstpy310/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/47800/SocialNavigation_v2/human-scene-transformer/human_scene_transformer/jrdb/eval.py", line 162, in main
evaluation(_CKPT_PATH.value)
File "/home/47800/SocialNavigation_v2/human-scene-transformer/human_scene_transformer/jrdb/eval.py", line 64, in evaluation
_, _ = model(next(iter(dataset.batch(1))), training=False)
File "/home/47800/miniconda3/envs/hstpy310/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/47800/SocialNavigation_v2/human-scene-transformer/human_scene_transformer/model/model.py", line 178, in call
input_batch = self.agent_encoding_layer(input_batch, training=training)
File "/home/47800/SocialNavigation_v2/human-scene-transformer/human_scene_transformer/model/agent_encoder.py", line 189, in call
attn_out, attn_score = self.attn_layer(
tensorflow.python.framework.errors_impl.AbortedError: Exception encountered when calling layer 'softmax' (type Softmax).
{{function_node __wrapped__Softmax_device_/job:localhost/replica:0/task:0/device:CPU:0}} Input dims must be <= 5 and >=1 [Op:Softmax] name:
Call arguments received by layer 'softmax' (type Softmax):
• inputs=tf.Tensor(shape=(1, 11, 19, 4, 1, 4), dtype=float32)
• mask=tf.Tensor(shape=(1, 11, 19, 1, 1, 4), dtype=bool) Good news is that the dataset has been loaded as expected, but there might be an extra dimension of the input data. |
Hi Mo, Could you please try and set the environment variable Thanks |
Hi Tim,
Now I am using the command lines, export TF_DISABLE_MKL=1
cd /home/47800/SocialNavigation_v2/human-scene-transformer/human_scene_transformer
export PYTHONPATH=/path/to/directory:/home/47800/SocialNavigation_v2/human-scene-transformer
python /home/47800/SocialNavigation_v2/human-scene-transformer/human_scene_transformer/jrdb/eval.py --model_path=./model/jrdb/ --checkpoint_path=./model/jrdb/ckpts/ckpt-30 but still have additional dim File "/home/47800/SocialNavigation_v2/human-scene-transformer/human_scene_transformer/jrdb/eval.py", line 64, in evaluation
_, _ = model(next(iter(dataset.batch(1))), training=False)
File "/home/47800/miniconda3/envs/hstpy310/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/47800/SocialNavigation_v2/human-scene-transformer/human_scene_transformer/model/model.py", line 178, in call
input_batch = self.agent_encoding_layer(input_batch, training=training)
File "/home/47800/SocialNavigation_v2/human-scene-transformer/human_scene_transformer/model/agent_encoder.py", line 189, in call
attn_out, attn_score = self.attn_layer(
tensorflow.python.framework.errors_impl.AbortedError: Exception encountered when calling layer 'softmax' (type Softmax).
{{function_node __wrapped__Softmax_device_/job:localhost/replica:0/task:0/device:CPU:0}} Input dims must be <= 5 and >=1 [Op:Softmax] name:
Call arguments received by layer 'softmax' (type Softmax):
• inputs=tf.Tensor(shape=(1, 11, 19, 4, 1, 4), dtype=float32)
• mask=tf.Tensor(shape=(1, 11, 19, 1, 1, 4), dtype=bool) Do you know which dim is redundant? BTW, is the 'M1' chip in the runtime table the APPLE M1 or Google cloud M1 ? |
Hi Mo, Could you, in addition also set Could you please outline the procedure of how you installed tensorflow? Conda / pip?
Unfortunately non of the dimensions is redundant
This is Apple M1 Best |
Hi Tim
Thank you so much. It works :)
similar to the codes below conda create -n env_name python=3.x
pip3 install -r human-scene-transformer/requirement.txt Btw, could you tell me why it works with updated, new Q: I1020 17:57:29.060910 140299564987008 train_model.py:256] Beginning training.
I1020 17:57:29.061050 140299564987008 train_model.py:259] 0
I1020 17:57:29.164841 140299564987008 api.py:460] train_step
I1020 17:57:34.702615 140299564987008 api.py:460] iter
I1020 17:57:35.291868 140299564987008 api.py:460] train_step
I1020 17:57:38.506911 140299564987008 api.py:460] iter
2023-10-20 17:57:51.474034: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 3825205248 exceeds 10% of free system memory.
2023-10-20 17:57:51.608297: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 3825205248 exceeds 10% of free system memory.
2023-10-20 17:57:52.013815: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 3825205248 exceeds 10% of free system memory.
2023-10-20 17:57:52.038450: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 3825205248 exceeds 10% of free system memory.
2023-10-20 17:58:01.058731: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 3825205248 exceeds 10% of free system memory. But fortunately, the eval.py runs well with the checkpoint. 3881it [21:54, 2.95it/s]
MinADE: 0.26
MinADE @ 1s: 0.12
MinADE @ 2s: 0.20
MinADE @ 3s: 0.28
MinADE @ 4s: 0.37
MLADE: 0.45
MLADE @ 1s: 0.21
MLADE @ 2s: 0.39
MLADE @ 3s: 0.56
MLADE @ 4s: 0.71
NLL: -0.59
NLL @ 1s: -0.90
NLL @ 2s: -0.65
NLL @ 3s: -0.08
NLL @ 4s: 0.32 |
Hi Mo, Glad it works now!
MKL is a tensorflow backend by Intel optimized for their CPUs. Unfortunately, it does not support tensor dimensions > 5 for some operations, making it incompatible with this codebase (We added a note in the readme).
Not necessarily. We combine many training iterations in one Feel free to close the issue should this have solved your problem. Best |
Hi Dr. Tim, It is solved perfectly. Thank you so much! Best regards, |
System version: Gcloud debian 11
Cpu: C3 8vCPU
Memory: 64 GB
Software version1:
Trace back:
When I try to run 'train.py', it happens. I think it might be caused by dataload or preprocessed data itself.
I am still trying to fix it......
Software version2
the behavior changes.
The text was updated successfully, but these errors were encountered: