training details #7

kuaileqipaoshui · 2024-03-17T01:27:56Z

RuntimeError: probability tensor contains either inf, nan or element < 0

There was an error during the evaluation. I felt that there was a problem with the installed package version. Could you provide the version of your installed package?

The text was updated successfully, but these errors were encountered:

ch3cook-fdu · 2024-03-17T14:12:06Z

Please try building the environment with the following order:

Set up conda environment.

conda create -n ll3da python=3.8
conda activate ll3da

Install PyTorch:

pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116

Install other packages:

pip install h5py scipy cython plyfile 'trimesh>=2.35.39,<2.35.40' transformers>=4.37.0

build pointnet++ and giou support:

cd third_party/pointnet2
python setup.py install
cd ../../utils
python cython_compile.py build_ext --inplace

If the issue still appears, please let me know.

kuaileqipaoshui · 2024-03-18T03:19:42Z

conda activate ll3da

Thanks, I will try. I have a question, when I look at the script, the evaluation is also based on only the trained checkpoint (--test_ckpt ./ckpts/opt-1.3b/ll3da-generalist/checkpoint.pth ), the tuned checkpoints are not used. What is the use of the tuned checkpoint?

ch3cook-fdu · 2024-03-18T04:46:12Z

We train our model on the combination of Nr3D and ScanRefer for describing objects. However, these two datasets are annotated in different styles. Thus it is required to tune on each dataset, respectively.

ch3cook-fdu · 2024-03-18T05:22:52Z

Since LL3DA is a 3D generalist, it can distinguish different tasks given human interactions. You can directly evaluate on ScanQA with the generalist checkpoint, or try fine-tuning it.

kuaileqipaoshui · 2024-03-18T07:43:35Z

Since LL3DA is a 3D generalist, it can distinguish different tasks given human interactions. You can directly evaluate on ScanQA with the generalist checkpoint, or try fine-tuning it.

----------------------Evaluation-----------------------
INFO: iou@0.5 matched proposals: [1525 / 2068],
[BLEU-1] Mean: 0.6246, Max: 1.0000, Min: 0.0000
[BLEU-2] Mean: 0.5269, Max: 1.0000, Min: 0.0000
[BLEU-3] Mean: 0.4311, Max: 1.0000, Min: 0.0000
[BLEU-4] Mean: 0.3519, Max: 1.0000, Min: 0.0000
[CIDEr] Mean: 0.5911, Max: 5.4976, Min: 0.0000
[ROUGE-L] Mean: 0.5407, Max: 1.0000, Min: 0.1015
[METEOR] Mean: 0.2519, Max: 1.0000, Min: 0.0448

When I directly evaluate on ScanQA with the generalist checkpoint, I got the above result. I found that the result of C@0.5 is very different from that in the paper, and other metrics are similar to those in the paper. Why is this?

ch3cook-fdu · 2024-03-18T08:31:36Z

It seems the result you listed comes from the ScanRefer dataset for 3D dense captioning.

The results differ mainly because of 1. The randomness in data pre-processing (point down sampling), 2. Different PyTorch versions, and 3. randomness in training.

Please refer to: ch3cook-fdu/Vote2Cap-DETR#12 for more information.

Also, you are encouraged to check out the training log to see whether the performance aligns.

Additionally, the performance of 3D dense captioning might differ a little, since we do not distinguish ScanRefer from Nr3D during training. Maybe you should tune the model on each dataset for 3D dense captioning only.

matthewdm0816 · 2024-03-23T12:07:18Z

Hi, I tried the train.generalist.sh, but I can't reproduce a close performance as reported in the paper. the only change is the 24 batch size instead of 4 to speedup training

here are the eval logs on ScanQA, Nr3D and ScanRefer, at 20th epoch
----------------------Evaluation-----------------------

[BLEU-1] Mean: 0.3028, Max: 1.0000, Min: 0.0000
[BLEU-2] Mean: 0.1904, Max: 1.0000, Min: 0.0000
[BLEU-3] Mean: 0.1283, Max: 1.0000, Min: 0.0000
[BLEU-4] Mean: 0.0875, Max: 1.0000, Min: 0.0000
[CIDEr] Mean: 0.4818, Max: 8.0511, Min: 0.0000
[ROUGE-L] Mean: 0.2636, Max: 1.0000, Min: 0.0000
[METEOR] Mean: 0.1058, Max: 1.0000, Min: 0.0000
Evaluate [19/32]; Batch [0/1]; Evaluating on iter: 12999; Iter time 261.13; Mem 70618.97MB

----------------------Evaluation-----------------------
INFO: iou@0.5 matched proposals: [712 / 1214],
[BLEU-1] Mean: 0.5626, Max: 1.0000, Min: 0.0006
[BLEU-2] Mean: 0.3753, Max: 0.8165, Min: 0.0000
[BLEU-3] Mean: 0.2223, Max: 0.6583, Min: 0.0000
[BLEU-4] Mean: 0.1339, Max: 0.5756, Min: 0.0000
[CIDEr] Mean: 0.0945, Max: 1.2465, Min: 0.0000
[ROUGE-L] Mean: 0.4495, Max: 0.8299, Min: 0.1843
[METEOR] Mean: 0.2157, Max: 0.5162, Min: 0.0783
Evaluate [19/32]; Batch [0/1]; Evaluating on iter: 12999; Iter time 262.18; Mem 70618.97MB

----------------------Evaluation-----------------------
INFO: iou@0.5 matched proposals: [1506 / 2068],
[BLEU-1] Mean: 0.6056, Max: 1.0000, Min: 0.0000
[BLEU-2] Mean: 0.4881, Max: 1.0000, Min: 0.0000
[BLEU-3] Mean: 0.3775, Max: 0.9410, Min: 0.0000
[BLEU-4] Mean: 0.2926, Max: 0.8654, Min: 0.0000
[CIDEr] Mean: 0.3024, Max: 3.1209, Min: 0.0000
[ROUGE-L] Mean: 0.4990, Max: 0.9412, Min: 0.1015
[METEOR] Mean: 0.2349, Max: 0.5416, Min: 0.0448

the training log is here

it would be nice if the pretrained checkpoints/pre-processed point clouds can be downloaded to minimize the randomness

ch3cook-fdu · 2024-03-23T14:26:55Z

The actual batch size of our original configure is 4 x 8 gpus = 32 per iteration. To re-produce our results, we encourage you to train with the exact same config as we provided.

Please track the training process on the number of iterations rather than epoch numbers. Based on our experience, training LL3DA with only 13k iterations is far from convergence.

We are actively working on packing the pre-trained weights, please stay tuned.

kuaileqipaoshui · 2024-03-26T09:21:18Z

The actual batch size of our original configure is 4 x 8 gpus = 32 per iteration. To re-produce our results, we encourage you to train with the exact same config as we provided.

Please track the training process on the number of iterations rather than epoch numbers. Based on our experience, training LL3DA with only 13k iterations is far from convergence.

We are actively working on packing the pre-trained weights, please stay tuned.

when I use actual batch size of the original configure is 4 x 8 gpus = 32 per iteration, I find the training log:
Epoch [2/32]; Iter [11990/127936]; Loss 1.51; LR 9.79e-05; Iter time 0.46; ETA 14:48:34; Mem 18615.49MB
Loss in not finite. Skip this training step.
Loss in not finite. Skip this training step.
Loss in not finite. Skip this training step.
Loss in not finite. Skip this training step.
Loss in not finite. Skip this training step.
Loss in not finite. Skip this training step.
Loss in not finite. Skip this training step.
Loss in not finite. Skip this training step.
Epoch [3/32]; Iter [12000/127936]; Loss 1.51; LR 9.79e-05; Iter time 0.48; ETA 15:23:59; Mem 18615.49MB
What happened?

ch3cook-fdu · 2024-03-26T10:15:57Z

Because of the mixed precision training, the training process might not be that stable. As long as the model training continues, you can just ignore this message.

This comment was marked as duplicate.

Sign in to view

kuaileqipaoshui closed this as completed Mar 18, 2024

kuaileqipaoshui reopened this Mar 18, 2024

ch3cook-fdu changed the title ~~version~~ training details Mar 26, 2024

kuaileqipaoshui closed this as completed May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training details #7

training details #7

kuaileqipaoshui commented Mar 17, 2024

ch3cook-fdu commented Mar 17, 2024

kuaileqipaoshui commented Mar 18, 2024

ch3cook-fdu commented Mar 18, 2024

This comment was marked as duplicate.

ch3cook-fdu commented Mar 18, 2024

kuaileqipaoshui commented Mar 18, 2024

ch3cook-fdu commented Mar 18, 2024

matthewdm0816 commented Mar 23, 2024

ch3cook-fdu commented Mar 23, 2024 •

edited

Loading

kuaileqipaoshui commented Mar 26, 2024

ch3cook-fdu commented Mar 26, 2024 •

edited

Loading

training details #7

training details #7

Comments

kuaileqipaoshui commented Mar 17, 2024

ch3cook-fdu commented Mar 17, 2024

kuaileqipaoshui commented Mar 18, 2024

ch3cook-fdu commented Mar 18, 2024

This comment was marked as duplicate.

ch3cook-fdu commented Mar 18, 2024

kuaileqipaoshui commented Mar 18, 2024

ch3cook-fdu commented Mar 18, 2024

matthewdm0816 commented Mar 23, 2024

ch3cook-fdu commented Mar 23, 2024 • edited Loading

kuaileqipaoshui commented Mar 26, 2024

ch3cook-fdu commented Mar 26, 2024 • edited Loading

ch3cook-fdu commented Mar 23, 2024 •

edited

Loading

ch3cook-fdu commented Mar 26, 2024 •

edited

Loading