Multi-gpu training with slurm times out #1832

nightingal3 · 2024-11-19T18:41:28Z

Bug description

I was transferring some checkpoints from a cluster that didn't use slurm to one that does use slurm. I trained the checkpoint using multiple gpus/nodes, and I found that I'm able to load and start training it when using an interactive job. However, when I use sbatch to submit my job, the job times out after some time.

I've seen this post: https://lightning.ai/docs/fabric/2.4.0/guide/multi_node/slurm.html and added srun to my submission script. However, even though 4 devices seem to be initialized, the model still gets stuck before training and times out.

A debug log and my submission script is linked. My sbatch script is a bit different since it runs another sh script, which does a bunch of stuff and then litgpt pretrain <...>, but I'm not sure this would be an issue...

I also tried setting the fabric initialization to explicitly have the number of nodes, devices, etc like in the example in pretrain.py but it didn't make a difference:

fabric = L.Fabric(
        accelerator="gpu", devices=4, num_nodes=1, strategy=strategy, precision=precision, loggers=[logger]
    )

Details:
My script:

#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=slurm_logs/%j.out
#SBATCH --time=2-00:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:A6000:4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH --mail-user=<email>
#SBATCH --mail-type=ALL

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_DEBUG=INFO

# Check if training type is provided
if [ $# -eq 0 ]; then
    echo "Usage: $0 <sequential|mixed> [training args...]"
    exit 1
fi

# Get the training type and remove it from args
TRAIN_TYPE=$1
shift

case $TRAIN_TYPE in
    sequential)
        srun ./pretrain_then_finetune.sh "$@"
        ;;
    mixed)
        srun ./mixed_pretraining_fixed.sh "$@"
        ;;
    *)
        echo "Invalid training type. Use 'sequential' or 'mixed'"
        exit 1
        ;;
esac

Debug example error:

[...previous stuff]
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[rank: 0] Seed set to 42
[rank: 0] Seed set to 42
[rank: 0] Seed set to 42
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

[rank: 0] Seed set to 42
babel-0-31:2324237:2324237 [0] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324237:2324237 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.21.5+cuda12.4
/home/mengyan3/.local/lib/python3.9/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
babel-0-31:2324240:2324240 [1] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324240:2324240 [1] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324240:2324524 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324240:2324524 [1] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324240:2324524 [1] NCCL INFO Using network IB
babel-0-31:2324240:2324524 [1] NCCL INFO ncclCommInitRank comm 0x555d6dc227b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 81000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324240:2324524 [1] NCCL INFO Setting affinity for GPU 1 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324240:2324524 [1] NCCL INFO NVLS multicast support is not available on dev 1
babel-0-31:2324240:2324524 [1] NCCL INFO comm 0x555d6dc227b0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
babel-0-31:2324240:2324524 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
babel-0-31:2324240:2324524 [1] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Connected all rings
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Connected all trees
babel-0-31:2324240:2324524 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324240:2324524 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324240:2324524 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324240:2324524 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324240:2324524 [1] NCCL INFO ncclCommInitRank comm 0x555d6dc227b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 81000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank1]:[E1119 13:21:12.512786305 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
babel-0-31:2324239:2324239 [3] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324239:2324239 [3] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324239:2324522 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324239:2324522 [3] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324239:2324522 [3] NCCL INFO Using network IB
babel-0-31:2324239:2324522 [3] NCCL INFO ncclCommInitRank comm 0x5584021f39f0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e1000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324239:2324522 [3] NCCL INFO Setting affinity for GPU 3 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324239:2324522 [3] NCCL INFO NVLS multicast support is not available on dev 3
babel-0-31:2324239:2324522 [3] NCCL INFO comm 0x5584021f39f0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
babel-0-31:2324239:2324522 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
babel-0-31:2324239:2324522 [3] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Connected all rings
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Connected all trees
babel-0-31:2324239:2324522 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324239:2324522 [3] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324239:2324522 [3] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324239:2324522 [3] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324239:2324522 [3] NCCL INFO ncclCommInitRank comm 0x5584021f39f0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e1000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank3]:[E1119 13:21:12.512781555 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
babel-0-31:2324238:2324238 [2] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324238:2324238 [2] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324238:2324523 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324238:2324523 [2] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324238:2324523 [2] NCCL INFO Using network IB
babel-0-31:2324238:2324523 [2] NCCL INFO ncclCommInitRank comm 0x55880160e670 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId a1000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324238:2324523 [2] NCCL INFO Setting affinity for GPU 2 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324238:2324523 [2] NCCL INFO NVLS multicast support is not available on dev 2
babel-0-31:2324238:2324523 [2] NCCL INFO comm 0x55880160e670 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
babel-0-31:2324238:2324523 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
babel-0-31:2324238:2324523 [2] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Connected all rings
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Connected all trees
babel-0-31:2324238:2324523 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324238:2324523 [2] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324238:2324523 [2] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324238:2324523 [2] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324238:2324523 [2] NCCL INFO ncclCommInitRank comm 0x55880160e670 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId a1000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank2]:[E1119 13:21:12.525244336 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
[rank1]:[E1119 13:21:13.938073877 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1119 13:21:13.938095107 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1119 13:21:13.938100947 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1119 13:21:13.938104817 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1119 13:21:13.938073737 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1119 13:21:13.938094977 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1119 13:21:13.938100577 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1119 13:21:13.938104557 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1119 13:21:13.938073817 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1119 13:21:13.938094667 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1119 13:21:13.938100907 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1119 13:21:13.938104757 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1119 13:21:13.092845528 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fdf89a24446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fdf8ad37772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fdf8ad3ebb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fdf8ad4061d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fdfd36cd5c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x89c02 (0x7fdfe2c89c02 in /lib64/libc.so.6)
frame #6: <unknown function> + 0x10ec40 (0x7fdfe2d0ec40 in /lib64/libc.so.6)

[rank1]:[E1119 13:21:13.092997168 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b4e1d4446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f3b4f4e7772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b4f4eebb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f3b4f4f061d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f3b97e7d5c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x89c02 (0x7f3ba7489c02 in /lib64/libc.so.6)
frame #6: <unknown function> + 0x10ec40 (0x7f3ba750ec40 in /lib64/libc.so.6)

[rank3]:[E1119 13:21:13.092988978 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe0c139c446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fe0c26af772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe0c26b6bb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fe0c26b861d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fe10b0455c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x89c02 (0x7fe11a689c02 in /lib64/libc.so.6)
frame #6: <unknown function> + 0x10ec40 (0x7fe11a70ec40 in /lib64/libc.so.6)

/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324239 Aborted                 (core dumped) litgpt pretrain $model_name --resume "${checkpoint_dir}/step${step}/lit_model.pth" --tokenizer_dir "${checkpoint_dir}/step${step}" --data FineWebDataset --data.data_path $pretraining_data_dir --data.val_data_path /data/datasets/hf_cache/data/fineweb/sample-350BT/val/0 --data.num_workers $SLURM_GPUS_ON_NODE --train.micro_batch_size $micro_batch_size --train.max_seq_len $max_seq_len --train.min_lr 1e-6 --train.max_iters ${max_iters} --train.max_additional_steps $max_additional_steps --train.save_interval 500 --train.log_interval $log_interval --train.lr_warmup_fraction 0.01 --train.lr_scheduler $lr_scheduler --eval.interval 1000 --out_dir $out_dir --logs_dir $out_dir --logger_name tensorboard
/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324238 Aborted                 (core dumped) litgpt pretrain $model_name --resume "${checkpoint_dir}/step${step}/lit_model.pth" --tokenizer_dir "${checkpoint_dir}/step${step}" --data FineWebDataset --data.data_path $pretraining_data_dir --data.val_data_path /data/datasets/hf_cache/data/fineweb/sample-350BT/val/0 --data.num_workers $SLURM_GPUS_ON_NODE --train.micro_batch_size $micro_batch_size --train.max_seq_len $max_seq_len --train.min_lr 1e-6 --train.max_iters ${max_iters} --train.max_additional_steps $max_additional_steps --train.save_interval 500 --train.log_interval $log_interval --train.lr_warmup_fraction 0.01 --train.lr_scheduler $lr_scheduler --eval.interval 1000 --out_dir $out_dir --logs_dir $out_dir --logger_name tensorboard
/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324240 Aborted                 (core dumped) litgpt pretrain $model_name --resume "${checkpoint_dir}/step${step}/lit_model.pth" --tokenizer_dir "${checkpoint_dir}/step${step}" --data FineWebDataset --data.data_path $pretraining_data_dir --data.val_data_path /data/datasets/hf_cache/data/fineweb/sample-350BT/val/0 --data.num_workers $SLURM_GPUS_ON_NODE --train.micro_batch_size $micro_batch_size --train.max_seq_len $max_seq_len --train.min_lr 1e-6 --train.max_iters ${max_iters} --train.max_additional_steps $max_additional_steps --train.save_interval 500 --train.log_interval $log_interval --train.lr_warmup_fraction 0.01 --train.lr_scheduler $lr_scheduler --eval.interval 1000 --out_dir $out_dir --logs_dir $out_dir --logger_name tensorboard
srun: First task exited 60s ago
srun: StepId=3178163.0 task 0: running
srun: StepId=3178163.0 tasks 1-3: exited
srun: Terminating StepId=3178163.0
slurmstepd: error: *** STEP 3178163.0 ON babel-0-31 CANCELLED AT 2024-11-19T13:24:07 ***
srun: Job step aborted: Waiting up to 122 seconds for job step to finish.
slurmstepd: error: --task-epilog failed status=9

Node information:

=== Slurm Environment ===
SLURM_NTASKS: 4
SLURM_PROCID: 0
SLURM_LOCALID: 0
SLURM_JOB_ID: 3178163

=== GPU Information ===
Available GPUs:
GPU 0: NVIDIA RTX A6000 (UUID: GPU-b349d8f4-c2a8-bd4b-2ed8-4678cc3093ad)
GPU 1: NVIDIA RTX A6000 (UUID: GPU-6386b3c4-ba07-b55c-a8d8-1d7e38378b83)
GPU 2: NVIDIA RTX A6000 (UUID: GPU-8a6310cf-0811-4754-64db-8c4117d4be50)
GPU 3: NVIDIA RTX A6000 (UUID: GPU-99486ac3-a03a-16fa-0e46-8bade74f121a)

GPU Topology:
	�[4mGPU0	GPU1	GPU2	GPU3	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID�[0m
GPU0	 X 	SYS	SYS	SYS	SYS	1-2,7-8,129-130	0		N/A
GPU1	SYS	 X 	NV4	NODE	NODE	64-65,67,69	1		N/A
GPU2	SYS	NV4	 X 	NODE	NODE	64-65,67,69	1		N/A
GPU3	SYS	NODE	NODE	 X 	NODE	64-65,67,69	1		N/A
NIC0	SYS	NODE	NODE	NODE	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

What operating system are you using?

Linux

LitGPT Version

Version: 0.4.0

The text was updated successfully, but these errors were encountered:

sadrafh · 2024-12-02T21:39:23Z

Same error!

nightingal3 · 2024-12-02T21:57:36Z

I fixed this issue by downgrading litdata from the latest version to 0.2.17 and increasing the number of workers in the dataloader. There seems to be some compatibility issue with the newer versions of litdata (besides freezing it also sometimes segfaults)

sadrafh · 2024-12-02T22:02:37Z

Thank you for sharing that. How can you increase the increasing the number of workers in the dataloader?

nightingal3 · 2024-12-02T22:03:58Z

I set num_workers in the dataclass to 8 again. I somehow changed it to 1 at some point which caused things to be very slow

sadrafh · 2024-12-03T00:46:38Z

Thank you

sadrafh · 2024-12-03T17:59:45Z

I used version 0.2.17 of the dataloader for my pretraining experiments and kept the default setting of num_workers as 8. During execution, I encountered the following warning:

/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of workers in the current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might cause the DataLoader to run slowly or even freeze. Lower the worker number to avoid potential slowness/freeze if necessary.

The only modification I made to the pretraining code was adding time.perf_counter() to log the time before and after the final checkpoint save operation. Despite this warning, the DataLoader seemed to function, but the system's freezes.

I use Llama3-70 but the max workload that I can see for transaction data is around 700m. I want to increase this amount., and not sure if there is boundary by litgpt

nightingal3 added the bug Something isn't working label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-gpu training with slurm times out #1832

Multi-gpu training with slurm times out #1832

nightingal3 commented Nov 19, 2024 •

edited

Loading

sadrafh commented Dec 2, 2024

nightingal3 commented Dec 2, 2024

sadrafh commented Dec 2, 2024

nightingal3 commented Dec 2, 2024

sadrafh commented Dec 3, 2024

sadrafh commented Dec 3, 2024 •

edited

Loading

Multi-gpu training with slurm times out #1832

Multi-gpu training with slurm times out #1832

Comments

nightingal3 commented Nov 19, 2024 • edited Loading

Bug description

What operating system are you using?

LitGPT Version

sadrafh commented Dec 2, 2024

nightingal3 commented Dec 2, 2024

sadrafh commented Dec 2, 2024

nightingal3 commented Dec 2, 2024

sadrafh commented Dec 3, 2024

sadrafh commented Dec 3, 2024 • edited Loading

nightingal3 commented Nov 19, 2024 •

edited

Loading

sadrafh commented Dec 3, 2024 •

edited

Loading