Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

something wrong with "answer_vectors = default_collate(_answer_vectors)" in ansemb/dataset/data_utils.py #3

Closed
hackerchenzhuo opened this issue Aug 23, 2020 · 11 comments

Comments

@hackerchenzhuo
Copy link

hackerchenzhuo commented Aug 23, 2020

train E000: 0% 0/3467 [00:02<?, ?it/s]
Traceback (most recent call last):
File "train_vqa_embedding.py", line 266, in
main(args)
File "train_vqa_embedding.py", line 239, in main
train(context_net, answer_net, train_loader, optimizer, tracker, args, prefix='train', epoch=i)
File "train_vqa_embedding.py", line 108, in train
for v, q, avocab, a, labels, idx, q_len in tq:

File "/home/anaconda3/lib/python3.6/site-packages/tqdm/std.py", line 1130, in iter
for obj in iterable:
File "/home/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/home/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/home/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/answer_embedding-master/ansemb/dataset/data_utils.py", line 95, in collate_fn
answer_vectors = default_collate(_answer_vectors)

File "/home/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not Tensor

@hackerchenzhuo
Copy link
Author

hi :)
it seems that this error is raised by the update of torch.stack(), how could I fix it to make it run?

@hexiang-hu
Copy link
Owner

Hi,

It seems that the variable ``batch'' here is not a tuple (or a list) of tensors.

I think the easiest way to debug this is to set the number of workers to be 0 and then trace the line#95 of collate_fn in data_uils.py to see what kind of data structure this ``_anwer_vectors'' is.

Best,

@hackerchenzhuo
Copy link
Author

hackerchenzhuo commented Aug 25, 2020

Thank you so much.
After seting the number of workers to be 0, some errors really disappear but some remain the same.
like:

_train E000: 0% 0/3467 [00:02<?, ?it/s]
Traceback (most recent call last):
File "train_vqa_embedding.py", line 266, in
main(args)
File "train_vqa_embedding.py", line 239, in main
train(context_net, answer_net, train_loader, optimizer, tracker, args, prefix='train', epoch=i)
File "train_vqa_embedding.py", line 108, in train
for v, q, avocab, a, labels, idx, q_len in tq:
File "/home/chenzhuo/anaconda3/lib/python3.6/site-packages/tqdm/std.py", line 1130, in iter
for obj in iterable:
File "/home/chenzhuo/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/chenzhuo/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/chenzhuo/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/chenzhuo/answer_embedding-master/ansemb/dataset/data_utils.py", line 95, in collate_fn
answer_vectors = default_collate(_answer_vectors)
File "/home/chenzhuo/anaconda3/lib/python3.6/site-packages/torch/utils/data/utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not Tensor

And I use
print(" data structure of _answer_vectors:", type(_answer_vectors))
to see the data structure of this ``_anwer_vectors'' before
answer_vectors = default_collate(_answer_vectors)

it print that : data structure of _answer_vectors: <class 'torch.Tensor'>

@hackerchenzhuo
Copy link
Author

hackerchenzhuo commented Aug 25, 2020

this is the printed text before:

@xxx:~/answer_embedding-master$ python train_vqa_embedding.py --gpu_id 1
{'gpu_id': 1, 'finetune': False, 'batch_size': 128, 'max_negative_answer': 12000, 'answer_batch_size': 3000, 'loss_temperature': 0.01, 'pretrained_model': None, 'context_embedding': 'SAN', 'answer_embedding': 'BoW', 'name': None}
{'cache_path': '/home/answer_embedding-master/.cache', 'output_path': '/home/answer_embedding-master/outputs', 'embedding_size': 1024, 'seed': 1618, 'question_vocab_path': '/home/answer_embedding-master/data/question.vocab.json', 'image_size': 448, 'output_size': 14, 'preprocess_batch_size': 100, 'output_features': 2048, 'central_fraction': 0.875, 'TRAIN': {'epochs': 50, 'batch_size': 128, 'base_lr': 0.001, 'lr_decay': 15, 'data_workers': 0, 'answer_batch_size': 3000, 'max_negative_answer': 8000}, 'TEST': {'max_answer_index': 3000}, 'VQA2': {'qa_path': '/home/answer_embedding-master/data/vqa2', 'feature_path': '/home/answer_embedding-master/features/vqa-resnet-14x14.h5', 'answer_vocab_path': '/home/answer_embedding-master/data/answer.vocab.vqa.json', 'train_img_path': '/home/answer_embedding-master/data/vqa2/images/train2014', 'val_img_path': '/home/answer_embedding-master/data/vqa2/images/val2014', 'test_img_path': '/home/answer_embedding-master/data/vqa2/images/test-dev2015', 'train_qa': 'train2014', 'val_qa': 'val2014', 'test_qa': 'test-dev2015', 'task': 'OpenEnded', 'dataset': 'mscoco'}, 'VG': {'qa_path': '/home/answer_embedding-master/data/vg', 'feature_path': '/home/answer_embedding-master/features/vg-resnet-14x14.h5', 'answer_vocab_path': '/home/answer_embedding-master/data/answer.vocab.vg.json', 'train_qa': 'VG_train_decoys.json', 'val_qa': 'VG_val_decoys.json', 'test_qa': 'VG_test_decoys.json', 'img_path': '/home/answer_embedding-master/data/vg/images'}, 'Visual7W': {'qa_path': '/home/answer_embedding-master/data/v7w', 'feature_path': '/home/answer_embedding-master/features/vg-resnet-14x14.h5', 'answer_vocab_path': '/home/answer_embedding-master/data/answer.vocab.v7w.json', 'train_qa': 'v7w_train_questions.json', 'val_qa': 'v7w_val_questions.json', 'test_qa': 'v7w_test_questions.json', 'train_v7w_decoys': 'v7w_train_decoys.json', 'val_v7w_decoys': 'v7w_val_decoys.json', 'test_v7w_decoys': 'v7w_test_decoys.json', 'img_path': '/home/answer_embedding-master/data/v7w/images'}}
Output data would be saved to /home/answer_embedding-master/outputs/SAN_BoW_vqa_batch_softmax_embedding_2020-08-25_12:45:17.pth

  • Loading vectors to .vector_cache/glove.840B.300d.txt.pt
    import answer vocabulary from: /home/answer_embedding-master/data/answer.vocab.vqa.json
    extracting answers...
    loading cache from: /home/answer_embedding-master/.cache/v2_OpenEnded_mscoco_train2014_questions.json.v2_mscoco_train2014_annotations.json.pt
    import answer vocabulary from: /home/answer_embedding-master/data/answer.vocab.vqa.json
    extracting answers...
    loading cache from: /home/answer_embedding-master/.cache/v2_OpenEnded_mscoco_val2014_questions.json.v2_mscoco_val2014_annotations.json.pt
    /home/answer_embedding-master/ansemb/models/layers.py:101: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
    init.xavier_uniform(w)
    /home/answer_embedding-master/ansemb/models/embedding.py:48: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
    init.xavier_uniform(m.weight)
    Context Model:
    StackedAttentionEmbedding(
    (embedding): Embedding(15419, 300, padding_idx=0)
    (drop): Dropout(p=0.5, inplace=False)
    (text): Seq2SeqRNN(
    (rnn): LSTM(300, 512, batch_first=True, bidirectional=True)
    )
    (attention): Attention(
    (v_conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (q_lin): Linear(in_features=1024, out_features=512, bias=True)
    (x_conv): Conv2d(512, 2, kernel_size=(1, 1), stride=(1, 1))
    (drop): Dropout(p=0.5, inplace=False)
    (relu): LeakyReLU(negative_slope=0.01, inplace=True)
    )
    (mlp): GroupMLP(
    (conv1): Conv1d(5120, 4096, kernel_size=(1,), stride=(1,))
    (drop): Dropout(p=0.5, inplace=False)
    (relu): LeakyReLU(negative_slope=0.01)
    (conv2): Conv1d(4096, 1024, kernel_size=(1,), stride=(1,), groups=64)
    )
    )
    Answer Model:
    MLPEmbedding(
    (mlp): GroupMLP(
    (conv1): Conv1d(300, 4096, kernel_size=(1,), stride=(1,))
    (drop): Dropout(p=0.5, inplace=False)
    (relu): LeakyReLU(negative_slope=0.01)
    (conv2): Conv1d(4096, 1024, kernel_size=(1,), stride=(1,), groups=64)
    )
    )
    train E000: 0% 0/3467 [00:00<?, ?it/s] data structure of _answer_vectors: <class 'torch.Tensor'>
    train E000: 0% 0/3467 [00:00<?, ?it/s]
    Traceback (most recent call last):
    ....error text ...

And the way I used the image future file refer to vqa-resnet-14x14.h5 from How to generate "vqa-resnet-14x14.h5"?

I modied the preprocess-images.py from pytorch-vqa repo, I dont know whether the error is raised by this?

@hexiang-hu
Copy link
Owner

Hi,

I think this error is very likely due to the change of pytorch version. Unfortunately that I do not have the machine & data to re-run this experiment at this moment.

Can you see if it would work for you to change the L#95 ``answer_vectors = default_collate(_answer_vectors)'' to

answer_vectors = default_collate((_answer_vectors,))

If not, can you check the shape of ``_answer_vectors''?

@hackerchenzhuo
Copy link
Author

hackerchenzhuo commented Aug 25, 2020

data structure of _answer_vectors: <class 'torch.Tensor'>
shape of _answer_vectors: torch.Size([128, 3000])

I change the L#95 ``answer_vectors = default_collate(_answer_vectors)'' to
answer_vectors = default_collate((_answer_vectors,))
which raised new error:(But the original error is gone)

Traceback (most recent call last):
  File "train_vqa_embedding.py", line 266, in <module>
    main(args)
  File "train_vqa_embedding.py", line 239, in main
    train(context_net, answer_net, train_loader, optimizer, tracker, args, prefix='train', epoch=i)
  File "train_vqa_embedding.py", line 115, in train
    answer_var, answer_len = loader.dataset._get_answer_vectors(avocab)
  File "/home/answer_embedding-master/ansemb/dataset/base.py", line 93, in _get_answer_vectors
    vector[idx, :] = self._encode_answer_vector(self.index_to_answer[answer_id])
KeyError: tensor(0)

@hexiang-hu
Copy link
Owner

hexiang-hu commented Aug 25, 2020

This is because the type of ``answer_id'' is a tensor rather than a int. I think you can change it to:

vector[idx, :] = self._encode_answer_vector(self.index_to_answer[answer_id.item()])

@hackerchenzhuo
Copy link
Author

train E000:   0% 0/3467 [00:07<?, ?it/s]
Traceback (most recent call last):
  File "train_vqa_embedding.py", line 266, in <module>
    main(args)
  File "train_vqa_embedding.py", line 239, in main
    train(context_net, answer_net, train_loader, optimizer, tracker, args, prefix='train', epoch=i)
  File "train_vqa_embedding.py", line 133, in train
    acc = utils.batch_accuracy(predicts.data, a.data).cpu()
  File "/home/answer_embedding-master/ansemb/utils.py", line 27, in batch_accuracy
    agreeing = true.gather(dim=1, index=predicted_index)
RuntimeError: invalid argument 4: Index tensor must have same dimensions as input tensor at /pytorch/aten/src/THC/generic/THCTensorScatterGather.cu:16

it seems that some error raised during the computing of acc

@hexiang-hu
Copy link
Owner

This one is also due to the API change of Pytorch.

You can check the shape of true'' and predicted_index'', to make sure they have the same dimensionality.

Please refer to this https://pytorch.org/docs/stable/generated/torch.gather.html.

Check some examples about how to use ``torch.gather'' on stackoverflow would be helpful for you to debug.

@hackerchenzhuo
Copy link
Author

yes they dont have the same dimensionality

shape of true: torch.Size([1, 128, 3000])
type of true: <class 'torch.Tensor'>
shape of predicted_index: torch.Size([128, 1])
type of predicted_index: <class 'torch.Tensor'>

@hackerchenzhuo
Copy link
Author

It work!
after i change the
agreeing = true.gather(dim=1, index=predicted_index)
into
agreeing = true[0].gather(dim=1, index=predicted_index)
And fix up some small problem.

Thank you so much again! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants