A Question about bash scripts/Retriever/pre_pipeline.sh #6

jaime-choi · 2024-11-21T14:16:40Z

Hello! As written on README, I am trying to run !bash scripts/Retriever/pre_pipeline.sh on Colab to get LM-preferred and human-preferred documents for the augmentation-adapted training.
I think I satisfied all the requirements but get this error whenever I try to run the code. Is there any solution to this error?

`/content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever
torchrun --master_port 10874 --nproc_per_node 1 /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/src/LM/Flan-T5/train_t0.py --model-config /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/src/LM/Flan-T5/configs/model/t5_base_config.json --model-parallel-size 1 --batch-size 9 --dev-batch-size 9 --eval-batch-size 9 --save /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/results/flan-t5-base/fp16/zs/marco_qa_msmarco_ra_ance --log-file /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/results/flan-t5-base/fp16/zs/marco_qa_msmarco_ra_ance/log.txt --load /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/checkpoints/flan-t5-base/t5-MP1/ --data-names marco_qa_msmarco_ra_ance --FiD --passage_num 10 --distributed-backend nccl --no-load-optim --lr-decay-style constant --weight-decay 1e-2 --clip-grad 1.0 --tokenizer-path /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/src/LM/Flan-T5/vocab_en --checkpoint-activations --deepspeed-activation-checkpointing --fp16 --deepspeed --deepspeed_config /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/src/LM/Flan-T5/configs/deepspeed/ds_fp16.json --do-eval --test-num 1000000 --seed 10 --eval-per-prompt
[2024-11-21 14:06:28,305] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2024-11-21 14:06:32.732308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1732197992.753346 14546 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1732197992.759730 14546 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: No training data specified
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
[2024-11-21 14:06:36,428] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-21 14:06:36,429] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
initializing model parallel with size 1
[2024-11-21 14:06:36,435] [INFO] [config.py:733:init] Config mesh_device None world_size = 1
[2024-11-21 14:06:36,436] [INFO] [checkpointing.py:1002:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2024-11-21 14:06:36,436] [INFO] [checkpointing.py:1123:configure] Activation Checkpointing Information
[2024-11-21 14:06:36,436] [INFO] [checkpointing.py:1124:configure] ----Partition Activations False, CPU CHECKPOINTING False
[2024-11-21 14:06:36,436] [INFO] [checkpointing.py:1125:configure] ----contiguous Memory Checkpointing False with 24 total layers
[2024-11-21 14:06:36,436] [INFO] [checkpointing.py:1126:configure] ----Synchronization False
[2024-11-21 14:06:36,436] [INFO] [checkpointing.py:1127:configure] ----Profiling time in checkpointing False
Training Enc-Dec model
arguments:
model_config ................. /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/src/LM/Flan-T5/configs/model/t5_base_config.json
cpu_optimizer ................ False
cpu_torch_adam ............... False
fp16 ......................... True
fp32_embedding ............... False
fp32_layernorm ............... False
fp32_tokentypes .............. False
fp32_allreduce ............... False
hysteresis ................... 2
loss_scale ................... None
loss_scale_window ............ 1000
min_scale .................... 1
do_train ..................... False
do_valid ..................... False
do_valid_and_eval ............ False
do_eval ...................... True
do_infer ..................... False
train_ratio .................. 1.0
train_num .................... -1
dev_ratio .................... 1.0
dev_num ...................... -1
test_ratio ................... 1.0
test_num ..................... 1000000
epochs ....................... 1
batch_size ................... 9
dev_batch_size ............... 9
gradient_accumulation_steps .. 1
weight_decay ................. 0.01
checkpoint_activations ....... True
checkpoint_num_layers ........ 1
num_checkpoints .............. 24
deepspeed_activation_checkpointing True
clip_grad .................... 1.0
train_iters .................. 1000000
log_interval ................. 100
max_save ..................... -1
seed ......................... 10
few_data_num ................. None
few_data_names ............... None
data_aug ..................... None
rand_real_label .............. False
rand_pseudo_label ............ False
lr_decay_iters ............... None
lr_decay_style ............... constant
lr ........................... 0.0001
warmup ....................... 0.0
warmup_iter .................. 0
save ......................... /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/results/flan-t5-base/fp16/zs/marco_qa_msmarco_ra_ance
save_interval ................ 5000
no_save_optim ................ False
load ......................... /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/checkpoints/flan-t5-base/t5-MP1/
load_oprimizer_states ........ False
load_lr_scheduler_states ..... False
no_load_optim ................ True
log_file ..................... /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/results/flan-t5-base/fp16/zs/marco_qa_msmarco_ra_ance/log.txt
distributed_backend .......... nccl
local_rank ................... 0
eval_batch_size .............. 9
eval_iters ................... 100
eval_interval ................ 1000
eval_per_prompt .............. True
no_norm_cand_loss ............ False
model_parallel_size .......... 1
data_path .................... None
data_ext ..................... .json
data_name .................... None
data_names ................... marco_qa_msmarco_ra_ance
data_prefix .................. None
num_workers .................. 2
tokenizer_path ............... /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/src/LM/Flan-T5/vocab_en
enc_seq_length ............... 512
dec_seq_length ............... 256
pad_token ....................
FiD .......................... True
passage_num .................. 10
load_prompt .................. None
prompt_tune .................. False
prompt_config ................ None
save_prompt_only ............. False
sampling ..................... False
temperature .................. 1.2
top_p ........................ 0.9
top_k ........................ 0
max_generation_length ........ 64
min_generation_length ........ 0
num_beams .................... 1
no_repeat_ngram_size ......... 0
repetition_penalty ........... 1
early_stopping ............... False
length_penalty ............... 0
rule_path .................... None
flan_sample .................. False
flan_sample_max .............. 1000000
debug_option ................. -1
shuff_cand_idx ............... False
deepspeed .................... True
deepspeed_config ............. /content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/src/LM/Flan-T5/configs/deepspeed/ds_fp16.json
deepscale .................... False
deepscale_config ............. None
cuda ......................... True
rank ......................... 0
world_size ................... 1
dynamic_loss_scale ........... True
[2024-11-21 14:06:36,464] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2728 and data parallel seed: 10
All Data group: ['marco_qa_msmarco_ra_ance'] All Data: ['marco_qa_msmarco_ra_ance']
Total train epochs 1 | Total train iters 10 |
building Enc-Dec model ...
number of parameters on model parallel rank 0: 247577856
DeepSpeed is enabled.
[2024-11-21 14:06:46,024] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown
[rank0]: Traceback (most recent call last):
[rank0]: File "/content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/src/LM/Flan-T5/train_t0.py", line 1051, in
[rank0]: main()
[rank0]: File "/content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/src/LM/Flan-T5/train_t0.py", line 960, in main
[rank0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(
[rank0]: File "/content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/src/LM/Flan-T5/utils.py", line 216, in setup_model_and_optimizer
[rank0]: model, optimizer, _, lr_scheduler = deepspeed.initialize(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/init.py", line 175, in initialize
[rank0]: assert config is None, "Not sure how to proceed, we were given deepspeed configs in the deepspeed arguments and deepspeed.initialize() function call"
[rank0]: AssertionError: Not sure how to proceed, we were given deepspeed configs in the deepspeed arguments and deepspeed.initialize() function call
[rank0]:[W1121 14:06:47.435407457 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
E1121 14:06:49.215000 14534 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 14546) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/src/LM/Flan-T5/train_t0.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-11-21_14:06:49
host : 72200c38d88e
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 14546)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

/content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/tools/merge_qrels.py:15: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
scores = torch.load(args.scores_path)
Traceback (most recent call last):
File "/content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/tools/merge_qrels.py", line 15, in
scores = torch.load(args.scores_path)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1319, in load
with _open_file_like(f, "rb") as opened_file:
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 659, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 640, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'results/flan-t5-base/fp16/zs/marco_qa_msmarco_ra_ance/stored_FiD_scores.pt'
/content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/tools/build_hn.py:55: DeprecationWarning: Seeding based on hashing is deprecated
since Python 3.9 and will be removed in a subsequent version. The only
supported seed types are: None, int, float, str, bytes, and bytearray.
random.seed(datetime.now())
Traceback (most recent call last):
File "/content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/tools/build_hn.py", line 76, in
qrel = TrainPreProcessor.read_qrel(args.qrels)
File "/content/drive/MyDrive/DL2-FinalProject/Augmentation-Adapted-Retriever/tools/utils.py", line 51, in read_qrel
with open(relevance_file, encoding='utf8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/msmarco/qrels.marco_qa.tsv'
scripts/Retriever/pre_pipeline.sh: line 18: data/msmarco/t5-ance/marco_qa/train.hn.jsonl: No such file or directory
scripts/Retriever/pre_pipeline.sh: line 19: data/msmarco/t5-ance/marco_qa/val.hn.jsonl: No such file or directory
scripts/Retriever/pre_pipeline.sh: line 20: data/msmarco/t5-ance/marco_qa/train.new.hn.jsonl: No such file or directory`

The text was updated successfully, but these errors were encountered:

yuzc19 · 2024-11-26T20:27:23Z

Hi @jaime-choi

It seems to be a deepspeed issue. Can you double-check if your deepspeed and pytorch versions match our requirements.txt?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Question about bash scripts/Retriever/pre_pipeline.sh #6

A Question about bash scripts/Retriever/pre_pipeline.sh #6

jaime-choi commented Nov 21, 2024

yuzc19 commented Nov 26, 2024

A Question about bash scripts/Retriever/pre_pipeline.sh #6

A Question about bash scripts/Retriever/pre_pipeline.sh #6

Comments

jaime-choi commented Nov 21, 2024

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-11-21_14:06:49 host : 72200c38d88e rank : 0 (local_rank: 0) exitcode : 1 (pid: 14546) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

yuzc19 commented Nov 26, 2024

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-11-21_14:06:49
host : 72200c38d88e
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 14546)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html