Name	Name	Last commit message	Last commit date
parent directory ..
data-lists	data-lists
notes	notes
pre-AuroraGPT	pre-AuroraGPT
requirements	requirements
README.md	README.md
aws_ofi_nccl_plugin.sh	aws_ofi_nccl_plugin.sh
ds_to_universal.py	ds_to_universal.py
fused_stackcode.py	fused_stackcode.py
fused_stackcode_bysize.py	fused_stackcode_bysize.py
helpers.sh	helpers.sh
sunspot-env-2024-04-15-002.sh	sunspot-env-2024-04-15-002.sh
sunspot-env.sh	sunspot-env.sh
test_alcf.sh	test_alcf.sh
test_blend.sh	test_blend.sh
test_blend_full.sh	test_blend_full.sh
test_blendable_dataset.py	test_blendable_dataset.py
test_polaris.sh	test_polaris.sh
test_sirius.sh	test_sirius.sh
test_sunspot.sh	test_sunspot.sh
tokenizer.model	tokenizer.model

Megatron-DeepSpeed @ ALCF

Important

train_aGPT_7B.sh is the main entry point for launching distributed training on {Polaris, Aurora, Sunspot} @ ALCF.

🏃‍♂️ Running

To launch on {Polaris, Aurora, Sunspot} @ ALCF:

⏳ Request an interactive job with qsub -I:

qsub -A <your-project> -q debug -l select=2 -l walltime=01:00:00,filesystems=eagle:home -I

Or, alternatively, you can submit train_aGPT_7B.sh directly as a batch script with

cd Megatron-DeepSpeed
qsub -A <your-project> -q debug -l select=2 -l walltime=01:00:00:filesystems=eagle:home train_aGPT_7B.sh

⬇️ Clone repo + navigate into it:

git clone "https://github.com/argonne-lcf/Megatron-DeepSpeed"
cd Megatron-DeepSpeed

🐍 Setup Python:

NOTE: The following commands should be ran from Megatron-DeepSpeed, following the cd command from 2.

Load conda module and activate base environment:

export PBS_O_WORKDIR=$(pwd) && source ALCF/helpers.sh && ezpz_setup

[output]:

[Polaris]:

# [05:47:13 PM][foremans@x3001c0s13b1n0][/eagle/a/f/p/ar/Megatron-DeepSpeed-D/Megatron-DeepSpeed]
$ PBS_O_WORKDIR=$(pwd) source ALCF/helpers.sh && setup_python
Using WORKING_DIR: /eagle/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed
No conda_prefix or virtual_env found in environment...
Setting up conda...
Running on Polaris !!

Lmod is automatically replacing "nvhpc/23.9" with "gcc-native/12.3".


Lmod is automatically replacing "PrgEnv-nvhpc/8.5.0" with "PrgEnv-gnu/8.5.0".


Due to MODULEPATH changes, the following have been reloaded:
  1) cray-mpich/8.1.28

Found conda at: /soft/applications/conda/2024-04-29/mconda3
No VIRTUAL_ENV found in environment!
    - Trying to setup from /soft/applications/conda/2024-04-29/mconda3
    - Using VENV_DIR=/eagle/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/venvs/2024-04-29
    - Found existing venv, activating from /eagle/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/venvs/2024-04-29
[python] Using: /eagle/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/venvs/2024-04-29/bin/python3

[Aurora]:

# [10:04:02 PM][foremans@x4415c0s2b0n0][/gecko/A/fo/p/a/Megatron-DeepSpeed]
$ PBS_O_WORKDIR=$(pwd) source ALCF/helpers.sh && setup_python
Using WORKING_DIR: /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed
No conda_prefix or virtual_env found in environment...
Setting up conda...

The following have been reloaded with a version change:
  1) intel_compute_runtime/release/821.36 => intel_compute_runtime/release/803.29     2) oneapi/eng-compiler/2024.04.15.002 => oneapi/release/2024.1

Found conda at: /opt/aurora/24.086.0/frameworks/aurora_nre_models_frameworks-2024.1
No VIRTUAL_ENV found in environment!
    - Trying to setup from /opt/aurora/24.086.0/frameworks/aurora_nre_models_frameworks-2024.1
    - Using VENV_DIR=/gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1
    - Found existing venv, activating from /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1
[python] Using: /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1/bin/python3

[Sunspot]:

# [05:37:18 PM][foremans@x1921c0s0b0n0][/gila/A/fo/p/a/Megatron-DeepSpeed]
$ PBS_O_WORKDIR=$(pwd) source ALCF/helpers.sh && setup_python
Using WORKING_DIR: /gila/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed
No conda_prefix or virtual_env found in environment...
Setting up conda...
Running on SunSpot !!

Due to MODULEPATH changes, the following have been reloaded:
  1) gcc/12.2.0             5) mpich-config/collective-tuning/1024
  2) gmp/6.2.1-pcxzkau      6) mpich/icc-all-pmix-gpu/20231026
  3) mpc/1.3.1-dfagrna      7) oneapi/eng-compiler/2024.04.15.002
  4) mpfr/4.2.0-w7v7yjv

The following have been reloaded with a version change:
  1) intel_compute_runtime/release/821.36 => intel_compute_runtime/release/775.20
  2) spack-pe-gcc/0.7.0-24.086.0 => spack-pe-gcc/0.6.1-23.275.2
     UMD: agama-ci-devel-803.29 successfully loaded:
     UMD: graphics-compute-runtime/agama-ci-devel-803.29

The following have been reloaded with a version change:
  1) oneapi/eng-compiler/2024.04.15.002 => oneapi/release/2024.04.15.001

Found conda at: /soft/datascience/aurora_nre_models_frameworks-2024.1_preview_u1
No VIRTUAL_ENV found in environment!
    - Trying to setup from /soft/datascience/aurora_nre_models_frameworks-2024.1_preview_u1
    - Using VENV_DIR=/gila/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1_preview_u1
    - Found existing venv, activating from /gila/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1_preview_u1
[python] Using: /lus/gila/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1_preview_u1/bin/python3

🍋 Install ezpz:

mkdir deps &&  git clone https://github.com/saforem2/ezpz deps/ezpz
python3 -m pip install -e deps/ezpz --require-virtualenv

Setup wandb

NOTE: this can be disabled by setting export WANDB_DISABLED=1

🚀 Launch:

In this case, train a ~ 2B Model (with 10 layers), for 1000 iterations using the data file list in:

ALCF/data-lists/polaris/books.txt

with a micro-batch-size of 2 (MICRO_BATCH=2), with the torch.optim.AdamW optimizer (OPT=adamw).

Note that any of the options in the setParams function from ALCF/helpers.sh can be overridden dynamically at runtime using this technique.

# for systems other than Polaris, replace "polaris/books.txt" below with:
# "{aurora,sunspot}/books.txt", 
PBS_O_WORKDIR=$(pwd) DATA_FILE_LIST=./ALCF/data-lists/polaris/books.txt TRAIN_ITER=1000 NLAYERS=10 MICRO_BATCH=2 OPT=adamw bash train_aGPT_7B.sh

Note: If no additional options specified, i.e.
```
PBS_O_WORKDIR=$(pwd) bash train_aGPT_7B.sh
```
then this will fallback to using the default AuroraGPT-7B architecture with the full Dolma (v1.7) dataset.

[output]:

The outputs should look something like this, though YMMV (things change quick):

[Aurora]:

#[🌌][10:45:59 AM][foremans@x4711c1s2b0n0][…/Megatron-DeepSpeed][🌱 main][$!?]
$ export PBS_O_WORKDIR=$(pwd) && source ALCF/helpers.sh && setup_python

#[🌌][10:46:57 AM][foremans@x4711c1s2b0n0][…/Megatron-DeepSpeed][🌱 main][$!?][aurora_nre_models_frameworks-2024.1]
(aurora_nre_models_frameworks-2024.1) $ PBS_O_WORKDIR=$(pwd) DATA_FILE_LIST=./ALCF/data-lists/aurora/books.txt bash train_aGPT_7B.sh > train-log-$(tstamp).log 2>&1 &

Using WORKING_DIR: /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed
Running on: aurora
Using virtual_env: /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1 on top of conda from: /opt/aurora/24.086.0/frameworks/aurora_nre_models_frameworks-2024.1
[python] Using: /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1/bin/python3
Ensuring all dependencies from /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/ALCF/requirements/requirements.txt installed...

[notice] A new release of pip is available: 24.0 -> 24.1
[notice] To update, run: pip install --upgrade pip
┌─────────────────────────────────────────────────────────────────────┐
│ [savejobenv]:
│     • Writing PBS vars to: /home/foremans/.pbsenv
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ [HOSTS]:
│     • [host:0] - x4711c1s2b0n0.hostmgmt2711.cm.aurora.alcf.anl.gov
│     • [host:1] - x4711c1s3b0n0.hostmgmt2711.cm.aurora.alcf.anl.gov
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ [DIST INFO]:
│     • HOSTFILE=/var/spool/pbs/aux/684084.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
│     • NHOSTS=2
│     • NGPU_PER_HOST=12
│     • NGPUS=24
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ [LAUNCH]:
│     • To launch across all available GPUs, use:
│       'launch' ( = mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/684084.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov )
└─────────────────────────────────────────────────────────────────────┘
2024-06-21 10:47:09,771 - numexpr.utils - INFO - Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-06-21 10:47:09,772 - numexpr.utils - INFO - Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-06-21 10:47:09,772 - numexpr.utils - INFO - NumExpr defaulting to 8 threads.
/gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1/lib/python3.9/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.8.4' or n>
  from pandas.core.computation.check import NUMEXPR_INSTALLED
/opt/aurora/24.086.0/frameworks/aurora_nre_models_frameworks-2024.1/lib/python3.9/runpy.py:127: RuntimeWarning: 'ezpz.jobs' found in sys.modules after import of package 'ezpz', but prior to execution of 'ezpz.jobs'; this may result in u>
  warn(RuntimeWarning(msg))
[2024-06-21 10:47:10][INFO][jobs:366] - Caught PBS_JOBID='684084.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov', pbsnf=PosixPath('/var/spool/pbs/aux/684084.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov') from env. Saving jobenv!
[2024-06-21 10:47:10][WARNING][jobs:117] - /home/foremans/PBS-jobs/684084  already in /home/foremans/PBS-jobs.log,  not appending !!
[2024-06-21 10:47:10][INFO][jobs:192] - Saving job env to /home/foremans/PBS-jobs/684084/jobenv.sh
[2024-06-21 10:47:10][INFO][jobs:220] - Saving job env to /home/foremans/PBS-jobs/684084/jobenv.json
[2024-06-21 10:47:10][INFO][jobs:233] - Saving job env to /home/foremans/PBS-jobs/684084/jobenv.yaml
[2024-06-21 10:47:10][INFO][jobs:137] - Saving job env to .jobenv file in  /home/foremans/PBS-jobs/684084/.jobenv
[2024-06-21 10:47:10][INFO][jobs:137] - Saving job env to .jobenv file in  /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/.jobenv
[2024-06-21 10:47:10][WARNING][jobs:154] - To use launch alias, be sure to:  source /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/.jobenv
[2024-06-21 10:47:10][INFO][jobs:277] - Writing PBS env vars to  /home/foremans/PBS-jobs/684084 / jobenv{.sh, .yaml, .json}
[2024-06-21 10:47:10][WARNING][jobs:281] - Run: source ./.jobenv in your current shell to set job variables
[2024-06-21 10:47:10][INFO][jobs:374] -
[DIST_INFO]:
  • DEVICE=xpu
  • DEVICE_ID=xpu:0
  • DISTRIBUTED_BACKEND=ccl
  • GPUS_PER_NODE=12
  • HOSTS=['x4711c1s2b0n0', 'x4711c1s3b0n0']
  • HOSTFILE=/var/spool/pbs/aux/684084.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
  • HOSTNAME=x4711c1s2b0n0.hostmgmt2711.cm.aurora.alcf.anl.gov
  • LOCAL_RANK=0
  • MACHINE=Aurora
  • NUM_NODES=2
  • NGPUS=24
  • NODE_ID=0
  • RANK=0
  • SCHEDULER=PBS
  • WORLD_SIZE_TOTAL=24
  • WORLD_SIZE_IN_USE=1
[2024-06-21 10:47:10][CRITICAL][jobs:245] - To launch across ALL GPUs in your job, use:
LAUNCH_CMD=mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/684084.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
creating alias launch=mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/684084.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
Found ezpz!

[notice] A new release of pip is available: 24.0 -> 24.1
[notice] To update, run: pip install --upgrade pip
Done with ezpz.
Not using flash-attn!!
LR_ARGS: --lr 0.0003 --lr-decay-style cosine --lr-warmup-fraction 0.05
DS_CONFIG: /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/ds-configs/ds_stage1_mb4_gb768_pp1_bf16.json
ZS: 1, MB: 4, GB: 768, PP: 1, DTYPE: bf16
 Please see logs at: logs/ws24_ds_stage1_nl32_hs4096_mb4_seq4096_gb768_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05/20240621-104713_24_x4711c1s2b0n0.hostmgmt2711.cm.aurora.alcf.anl.gov
Checkpoints will be saved to: checkpoints/ws24_ds_stage1_nl32_hs4096_mb4_seq4096_gb768_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05
!! Caught USE_ACTIVATION_CHECKPOINTING=1 !!
!! Caught USE_ACTIVATION_CHECKPOINTING=1 !!
Setting up tokenizer with Llama2
Using data_file_list: ./ALCF/data-lists/aurora/books.txt
Using tokenizer: Llama2. Setting up data with ./ALCF/data-lists/aurora/books.txt
Calling:  setData() with ./ALCF/data-lists/aurora/books.txt
--------------------
Updated environment:
DATA_FILE_LIST: ./ALCF/data-lists/aurora/books.txt
NUM_DOCS: 3
 WEIGHT_SUM: 0.0072042092147565125
DFL_STEM: books
DATA_CACHE_PATH: .cache/books/index-cache
DATA_FLAGS:  --data-file-list ./ALCF/data-lists/aurora/books.txt
--------------------
[setData] DATA_FLAGS:  --data-file-list ./ALCF/data-lists/aurora/books.txt
[setData] TOKENIZER_FLAGS: --tokenizer-type Llama2Tokenizer --tokenizer-model /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/ALCF/tokenizer.model
Requirement already satisfied: pybind11 in ./venvs/aurora_nre_models_frameworks-2024.1/lib/python3.9/site-packages (2.12.0)

[notice] A new release of pip is available: 24.0 -> 24.1
[notice] To update, run: pip install --upgrade pip
make: Nothing to be done for 'default'.
/gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed
++++++++++++++++++++++++++++++++++++++++++++++++++
- MPICH_DIR=/opt/aurora/24.086.0/CNDA/mpich/20231026/mpich-ofi-all-icc-default-pmix-gpu-drop20231026
- Using /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1/bin/python3
- WORLD_SIZE:24
- BACKEND: ccl
- MODEL_TYPE: llama-seq4096-pp1-tp1-32layers-32heads-4096hidden
- Using DATA_FILE_LIST: ./ALCF/data-lists/aurora/books.txt
++++++++++++++++++++++++++++++++++++++++++++++++++

Currently Loaded Modules:
  1) mpich/icc-all-pmix-gpu/20231026       3) libfabric/1.15.2.0   5) cray-libpals/1.3.3            7) gmp/6.2.1-pcxzkau    9) mpc/1.3.1-dfagrna  11) intel_compute_runtime/release/803.29  13) frameworks/2024.1
  2) mpich-config/collective-tuning/1024   4) cray-pals/1.3.3      6) spack-pe-gcc/0.7.0-24.086.0   8) mpfr/4.2.0-w7v7yjv  10) gcc/12.2.0         12) oneapi/release/2024.1



Saving environment to checkpoints/ws24_ds_stage1_nl32_hs4096_mb4_seq4096_gb768_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05/.env
Not currently running. Continuing!
Launching with: MPICH
 mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/684084.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --genvall --cpu-bind depth -d 16 /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1/bin/python3 -Wignore /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/pretrain_gpt_alcf.py
Using data_cache_path: checkpoints/ws24_ds_stage1_nl32_hs4096_mb4_seq4096_gb768_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05/.cache/books/index-cache

        mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/684084.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --genvall --cpu-bind depth -d 16 /gecko/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.1/bin/python3 -Wignore /lus/gecko/projects/Aurora_deployment/foremans/projects/argonne-lcf/Megatron-DeepSpeed/pretrain_gpt_alcf.py         --bf16                 --split 100,0,0         --log-interval 1         --no-bias-gelu-fusion         --no-bias-dropout-fusion         --no-masked-softmax-fusion         --no-gradient-accumulation-fusion        >

[!! NOTE] View output at:
 logs/ws24_ds_stage1_nl32_hs4096_mb4_seq4096_gb768_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05/20240621-104713_24_x4711c1s2b0n0.hostmgmt2711.cm.aurora.alcf.anl.gov/output.log
Connected to tcp://x4711c1s2b0n0.hostmgmt2711.cm.aurora.alcf.anl.gov:7919
Launching application eafe3e80-ad2e-4cee-a3e4-d63af2a77c66
[2024-06-21 10:47:31,610] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[2024-06-21 10:47:31,610] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-21 10:47:31,610] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=15, local_rank=3, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=23, local_rank=11, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=12, local_rank=0, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=13, local_rank=1, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=3, local_rank=3, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=14, local_rank=2, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=16, local_rank=4, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=17, local_rank=5, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=18, local_rank=6, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=19, local_rank=7, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=20, local_rank=8, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=21, local_rank=9, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=22, local_rank=10, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=6, local_rank=6, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=8, local_rank=8, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=10, local_rank=10, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend ccl
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=1, local_rank=1, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=2, local_rank=2, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=4, local_rank=4, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=5, local_rank=5, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=7, local_rank=7, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=9, local_rank=9, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:31,611] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=11, local_rank=11, world_size=24, master_addr=10.115.79.12, master_port=29500
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=2/23][local_rank=2/11][node=0/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=6/23][local_rank=6/11][node=0/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=8/23][local_rank=8/11][node=0/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=1/23][local_rank=1/11][node=1/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=15/23][local_rank=3/11][node=1/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=17/23][local_rank=5/11][node=1/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=3/23][local_rank=3/11][node=1/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=13/23][local_rank=1/11][node=1/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=14/23][local_rank=2/11][node=0/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=4/23][local_rank=4/11][node=0/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=18/23][local_rank=6/11][node=0/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=20/23][local_rank=8/11][node=0/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=5/23][local_rank=5/11][node=1/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=21/23][local_rank=9/11][node=1/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=7/23][local_rank=7/11][node=1/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=23/23][local_rank=11/11][node=1/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=9/23][local_rank=9/11][node=1/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=12/23][local_rank=0/11][node=0/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=16/23][local_rank=4/11][node=0/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=10/23][local_rank=10/11][node=0/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=11/23][local_rank=11/11][node=1/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=19/23][local_rank=7/11][node=1/1]
[2024-06-21 10:47:32][INFO][dist:291] - [device='xpu'][rank=22/23][local_rank=10/11][node=0/1]
2024-06-21 10:47:32][INFO][dist:240] - DistInfo={
    "DEVICE": "xpu",
    "DEVICE_ID": "xpu:0",
    "DISTRIBUTED_BACKEND": "ccl",
    "GPUS_PER_NODE": 12,
    "HOSTFILE": "/var/spool/pbs/aux/684084.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov",
    "HOSTNAME": "x4711c1s2b0n0.hostmgmt2711.cm.aurora.alcf.anl.gov",
    "HOSTS": "['x4711c1s2b0n0', 'x4711c1s3b0n0']",
    "LOCAL_RANK": 0,
    "MACHINE": "Aurora",
    "NGPUS": 24,
    "NODE_ID": 0,
    "NUM_NODES": 2,
    "RANK": 0,
    "SCHEDULER": "PBS",
    "WORLD_SIZE_IN_USE": 24,
    "WORLD_SIZE_TOTAL": 24
}

# [...clipped...]

[2024-06-21 10:48:48][INFO][utils:307] - > elapsed time for building blendable dataset indices: 1.19 (sec)
[2024-06-21 10:48:48][INFO][utils:307] -  > saving index map files
[2024-06-21 10:48:51][INFO][utils:307] -  > finished saving index map files in 3.0829622745513916 seconds
[2024-06-21 10:48:51][INFO][utils:307] - > loading blendable dataset index: checkpoints/ws24_ds_stage1_nl32_hs4096_mb4_seq4096_gb768_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05/.cache/books/index-cache/49e9529a32d0a98f1e40f4a82872b11c_index.npy
[2024-06-21 10:48:52][INFO][utils:307] - > loading blendable dataset sample index: checkpoints/ws24_ds_stage1_nl32_hs4096_mb4_seq4096_gb768_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05/.cache/books/index-cache/49e9529a32d0a98f1e40f4a82872b11c_sample_index.npy
[2024-06-21 10:48:52][INFO][utils:307] - > finished loading in 0.30188989639282227 seconds
[2024-06-21 10:48:52][INFO][utils:307] -  >> building dataset for /gecko/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0002_text_document
[2024-06-21 10:48:52][INFO][utils:307] -  > building dataset index ...
[2024-06-21 10:48:52][INFO][utils:307] -     reading sizes...
[2024-06-21 10:48:52][INFO][utils:307] -     reading pointers...
[2024-06-21 10:48:52][INFO][utils:307] -     reading document index...
[2024-06-21 10:48:52][INFO][utils:307] -     creating numpy buffer of mmap...
[2024-06-21 10:48:52][INFO][utils:307] - /gecko/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0002_text_document.bin
[2024-06-21 10:48:52][INFO][utils:307] -     creating memory view of numpy buffer...
[2024-06-21 10:48:52][INFO][utils:307] -  > finished creating indexed dataset in 0.003112 seconds
[2024-06-21 10:48:52][INFO][utils:307] -     number of documents: 7386
[2024-06-21 10:48:52][INFO][utils:307] -  > dataset split:
[2024-06-21 10:48:52][INFO][utils:307] -     train:
[2024-06-21 10:48:52][INFO][utils:307] -      document indices in [0, 7386) total of 7386 documents
[2024-06-21 10:48:52][INFO][utils:307] -     validation:
[2024-06-21 10:48:52][INFO][utils:307] -      document indices in [7386, 7386) total of 0 documents
[2024-06-21 10:48:52][INFO][utils:307] -     test:
[2024-06-21 10:48:52][INFO][utils:307] -      document indices in [7386, 7386) total of 0 documents
[2024-06-21 10:48:52][INFO][utils:307] -  > loading doc-idx mapping from checkpoints/ws24_ds_stage1_nl32_hs4096_mb4_seq4096_gb768_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05/.cache/books/index-cache/1fa7757ef8907da21e1e1326705e7f3f_doc_idx.npy
[2024-06-21 10:48:52][INFO][utils:307] -  > loading sample-idx mapping from checkpoints/ws24_ds_stage1_nl32_hs4096_mb4_seq4096_gb768_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05/.cache/books/index-cache/1fa7757ef8907da21e1e1326705e7f3f_sample_idx.npy
[2024-06-21 10:48:52][INFO][utils:307] -  > loading shuffle-idx mapping from checkpoints/ws24_ds_stage1_nl32_hs4096_mb4_seq4096_gb768_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05/.cache/books/index-cache/1fa7757ef8907da21e1e1326705e7f3f_shuffle_idx.npy
[2024-06-21 10:48:52][INFO][utils:307] -     loaded indexed file in 0.008 seconds
[2024-06-21 10:48:52][INFO][utils:307] -     total number of samples: 34196233
[2024-06-21 10:48:52][INFO][utils:307] -     total number of epochs: 175
[2024-06-21 10:48:52][INFO][utils:307] - > size of blendable dataset: 245361763 samples
[2024-06-21 10:48:52][INFO][utils:307] -  >>> Finished building BlendableDataset in 4.613574266433716 seconds
[2024-06-21 10:48:52][INFO][pretrain_gpt_alcf:579] - > finished creating GPT datasets. Took: 45730179865763.24219s
[2024-06-21 10:48:53][INFO][training:88] - [after dataloaders are built] datetime=2024-06-21 10:48:53
[2024-06-21 10:48:53][INFO][training:307] - done with setup ...
[2024-06-21 10:48:53][INFO][training:313] - training ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (63763.34, 63857.25)
    train/valid/test-data-iterators-setup ..........: (12936.53, 13432.64)
[2024-06-21 10:48:53][INFO][training:88] - [before the start of training step] datetime=2024-06-21 10:48:53
[2024-06-21 10:48:53,396] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information
[2024-06-21 10:48:53,396] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-06-21 10:48:53,396] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with 32 total layers
[2024-06-21 10:48:53,396] [INFO] [checkpointing.py:545:forward] ----Synchronization False
[2024-06-21 10:48:53,396] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-06-21 10:50:42,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1867.64 | optimizer_gradients: 19.65 | optimizer_step: 46.07
[2024-06-21 10:50:42,167] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[1.887433467970254e-08, 1.887433467970254e-08], mom=[(0.9, 0.999), (0.9, 0.999)]
[2024-06-21 10:50:42,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 25341.72 | bwd_microstep: 77707.38 | bwd_inner_microstep: 75751.84 | bwd_allreduce_microstep: 1955.54 | step_microstep: 2218.38
[2024-06-21 10:50:42,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 25341.72 | bwd: 77707.38 | bwd_inner: 75751.84 | bwd_allreduce: 1955.54 | step: 2218.38
[2024-06-21 10:50:42][INFO][training:1609] -  iteration=       1/  317892 | consumed_samples=         768 | consumed_tokens=     3145728 | elapsed_time_per_iteration_ms=108893.2 | learning_rate=1.88743e-08 | global_batch_size=  768 | lm loss=11.133188 | loss_scale=1.0 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=7.053 | tokens_per_gpu_per_second_tgs=1203.674 | [LM]-TFLOPs=49.66 | [DS]-TFLOPs=73.32 |
[2024-06-21 10:50:42][INFO][utils:190] - [Rank 0] (after 1 iterations) memory (MB) | allocated: 18243.64111328125 | max allocated: 50664.2548828125 | reserved: 54556.0 | max reserved: 54556.0
(min, max) time across ranks (ms):
    forward-backward ...............................: (106622.81, 106624.28)
    optimizer ......................................: (2221.02, 2234.98)

[Sunspot]:

# [09:07:32 AM][foremans@x1921c0s0b0n0][~/q/llm.devkit/Megatron-DeepSpeed][🌱 main][$!?]
$ PBS_O_WORKDIR=$(pwd) DATA_FILE_LIST=./ALCF/data-lists/polaris/books.txt bash train_aGPT_7B.sh
source-ing /lus/gila/projects/Aurora_deployment/foremans/q4-drop_sunspot/llm.devkit/Megatron-DeepSpeed/ALCF/helpers.sh
Sourcing /home/foremans/q4-drop_sunspot/llm.devkit/setenv.sh...
     UMD: agama-ci-devel-736.9 successfully loaded:
     UMD: graphics-compute-runtime/agama-ci-devel-736.9 
Lmod has detected the following error: The following module(s) are unknown: "gcc/12.1.0"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "gcc/12.1.0"

Also make sure that all modulefiles written in TCL start with the string #%Module

Note: the module "intel_compute_runtime/release/agama-devel-647" cannot be unloaded because it was not loaded.

Running on SunSpot !!
[python] Using: /home/foremans/miniconda3/envs/q4-drop/bin/python3
Saving {PATH, LD_LIBRARY_PATH, htt{p,ps}_proxy, CFLAGS, PYTHONUSERBASE} to .deepspeed_env
Found ezpz!
/lus/gila/projects/Aurora_deployment/foremans/locations/sunspot/projects/saforem2/ezpz/src/ezpz/__init__.py
Has ezpz installed. Nothing to do.
Done with ezpz.
┌───────────────────────────────────────────────────────────────────
│ Writing PBS vars to /home/foremans/.pbsenv
│ HOSTFILE: /var/spool/pbs/aux/8988430.amn-0001
│ NHOSTS: 2
│ NGPU_PER_HOST: 12 GPUs per host
│ NGPUS: 24 GPUs total
└───────────────────────────────────────────────────────────────────
┌──────────────────────────────────────────────────────────────────
│ [Hosts]: 
│     • [host:0] - x1921c0s0b0n0.hostmgmt2000.cm.americas.sgi.com
│     • [host:1] - x1921c0s1b0n0.hostmgmt2000.cm.americas.sgi.com
└──────────────────────────────────────────────────────────────────
┌──────────────────────────────────────────────────────────────────
│ [DIST INFO]: 
│     • Loading job env from: /home/foremans/.pbsenv
│     • HOSTFILE: /var/spool/pbs/aux/8988430.amn-0001
│     • NHOSTS: 2
│     • NGPU_PER_HOST: 12
│     • NGPUS (NHOSTS x NGPU_PER_HOST): 24
│     • WORLD_SIZE: 24
│     • DIST_LAUNCH: mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/8988430.amn-0001
└──────────────────────────────────────────────────────────────────
┌──────────────────────────────────────────────────────────────────
│ [Launch]:
│     • Use: 'launch' (=mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/8988430.amn-0001)
│       to launch job
└──────────────────────────────────────────────────────────────────
DS_CONFIG: ds_stage2_mb4_gb96_pp1_bf16.json
ZS: 2, CPU_OPTIMIZER: , MB: 4, GB: 96, PP: 1, DTYPE: bf16!!!Please see logs at logs/ds_stage2_nl32_hs4096_mb4_seq4096_gb96_pp1_tp1_bf16/0404090742_x1921c0s0b0n0
!! Caught USE_ACTIVATION_CHECKPOINTING=1 !!
!! Caught USE_ACTIVATION_CHECKPOINTING=1 !!
Calling:  setData() with ./convergence_debug_small.txt
--------------------
Updated environment:
DATA_FILE_LIST: ./convergence_debug_small.txt
NUM_DOCS: 15
 WEIGHT_SUM: 15.0
DFL_STEM: convergence_debug_small
DATA_CACHE_PATH: /lus/gila/projects/Aurora_deployment/foremans/q4-drop_sunspot/llm.devkit/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache
--------------------
++++++++++++++++++++++++++++++++++++++++++++++++++
- MPICH_DIR=
- Using /home/foremans/miniconda3/envs/q4-drop/bin/python3
- WORLD_SIZE:24
- NCCL: nccl
- MODEL_TYPE: llama-seq4096-pp1-tp1-32layers-32heads-4096hidden
- Using DATA_FILE_LIST: ./convergence_debug_small.txt
++++++++++++++++++++++++++++++++++++++++++++++++++
! Using /home/foremans/miniconda3/envs/q4-drop/bin/deepspeed
/home/foremans/miniconda3/envs/q4-drop/bin/ds_report:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').require('deepspeed==0.12.3+6ea44d02')
/home/foremans/miniconda3/envs/q4-drop/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you dont plan on using image function
ality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torch
vision` from source?
  warn(
[2024-04-04 09:07:45,585] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to xpu (auto detect)
[2024-04-04 09:07:45,818] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to xpu (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
flash_attn ............. [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/foremans/miniconda3/envs/q4-drop/lib/python3.9/site-packages/torch']
torch version .................... 2.1.0a0+cxx11.abi
deepspeed install path ........... ['/lus/gila/projects/Aurora_deployment/foremans/q4-drop_sunspot/llm.devkit/DeepSpeed/deepspeed']
deepspeed info ................... 0.12.3+6ea44d02, 6ea44d02, HEAD
deepspeed wheel compiled w. ...... torch 2.1 
shared memory (/dev/shm) size .... 503.18 GB

    deepspeed --hostfile /lus/gila/projects/Aurora_deployment/foremans/q4-drop_sunspot/llm.devkit/Megatron-DeepSpeed/hostfile_deepspeed --launcher MPICH /lus/gila/projects/Aurora_deployment/
foremans/q4-drop_sunspot/llm.devkit/Megatron-DeepSpeed/pretrain_gpt_alcf.py     --bf16     --optimizer adamw     --split 100,0,0     --log-interval 1     --no-bias-gelu-fusion     --lr-decay
-style cosine     --no-bias-dropout-fusion     --no-masked-softmax-fusion     --tokenizer-type Llama2Tokenizer     --no-gradient-accumulation-fusion     --accumulate-allreduce-grads-in-fp32 
    --use-checkpoint-opt_param-scheduler     --tensorboard-dir checkpoints/ds_stage2_nl32_hs4096_mb4_seq4096_gb96_pp1_tp1_bf16/tensorboard     --log-timers-to-tensorboard     --log-optimizer
-states-to-tensorboard     --lr 0.0003     --save checkpoints/ds_stage2_nl32_hs4096_mb4_seq4096_gb96_pp1_tp1_bf16     --load checkpoints/ds_stage2_nl32_hs4096_mb4_seq4096_gb96_pp1_tp1_bf16  
   --seq-length 4096     --num-layers 32     --hidden-size 4096     --train-iters 317892     --eval-iters 10     --distributed-backend ccl     --num-attention-heads 32     --save-interval 20
0     --eval-interval 50000     --max-position-embeddings 4096     --micro-batch-size 4     --data-file-list ./convergence_debug_small.txt     --tensor-model-parallel-size 1     --global-bat
ch-size 96     --pipeline-model-parallel-size 1     --num-key-value-heads 8     --data-cache-path /lus/gila/projects/Aurora_deployment/foremans/q4-drop_sunspot/llm.devkit/Megatron-DeepSpeed/
.cache/convergence_debug_small/index-cache     --ffn-hidden-size 11008     --tokenizer-model /home/foremans/q4-drop_sunspot/llm.devkit/Megatron-DeepSpeed/ALCF/tokenizer.model     --no-query-
key-layer-scaling --use-rotary-position-embeddings --untie-embeddings-and-output-weights --swiglu --normalization rmsnorm --disable-bias-linear      --deepspeed-activation-checkpointing  --z
ero-stage=2  --deepspeed_config=ds_stage2_mb4_gb96_pp1_bf16.json  --no-pipeline-parallel  --deepspeed       --checkpoint-activations --checkpoint-num-layers 1           |& tee logs/ds_stage2
_nl32_hs4096_mb4_seq4096_gb96_pp1_tp1_bf16/0404090742_x1921c0s0b0n0/output.log

[!! NOTE] View output at:
logs/ds_stage2_nl32_hs4096_mb4_seq4096_gb96_pp1_tp1_bf16/0404090742_x1921c0s0b0n0/output.log

# ...

/gila/Aurora_deployment/AuroraGPT/datasets/dolma/data_Llama2Tokenizer/common-crawl/cc_en_middle/cc_en_middle-0051_text_document.bin
    creating memory view of numpy buffer...
 > finished creating indexed dataset in 0.010017 seconds
    number of documents: 1498927
 > dataset split:
    train:
     document indices in [0, 1498927) total of 1498927 documents
    validation:
     document indices in [1498927, 1498927) total of 0 documents
    test:
     document indices in [1498927, 1498927) total of 0 documents
 > loading doc-idx mapping from /lus/gila/projects/Aurora_deployment/foremans/q4-drop_sunspot/llm.devkit/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache/bf90c74a625ac2ee4de6e1d6f7f84fbb_doc_idx.npy
 > loading sample-idx mapping from /lus/gila/projects/Aurora_deployment/foremans/q4-drop_sunspot/llm.devkit/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache/bf90c74a625ac2ee4de6e1d6f7f84fbb_sample_idx.npy
 > loading shuffle-idx mapping from /lus/gila/projects/Aurora_deployment/foremans/q4-drop_sunspot/llm.devkit/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache/bf90c74a625ac2ee4de6e1d6f7f84fbb_shuffle_idx.npy
    loaded indexed file in 0.056 seconds
    total number of samples: 2318461
    total number of epochs: 8
> loading blendable dataset index: /lus/gila/projects/Aurora_deployment/foremans/q4-drop_sunspot/llm.devkit/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache/3a426af74008c22f9db24db811aad6b7_index.npy
> loading blendable dataset sample index: /lus/gila/projects/Aurora_deployment/foremans/q4-drop_sunspot/llm.devkit/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache/3a426af74008c22f9db24db811aad6b7_sample_index.npy
/home/foremans/miniconda3/envs/q4-drop/lib/python3.9/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 2 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.

[after dataloaders are built] datetime: 2024-04-04 09:09:27
done with setup ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (64818.18, 64858.22)
    train/valid/test-data-iterators-setup ..........: (1968.10, 2288.56)
training ...
[before the start of training step] datetime: 2024-04-04 09:09:27
[2024-04-04 09:09:27,718] [INFO] [checkpointing.py:540:forward] Activation Checkpointing Information
[2024-04-04 09:09:27,719] [INFO] [checkpointing.py:541:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-04-04 09:09:27,719] [INFO] [checkpointing.py:542:forward] ----contiguous Memory Checkpointing False with 32 total layers
[2024-04-04 09:09:27,719] [INFO] [checkpointing.py:544:forward] ----Synchronization False
[2024-04-04 09:09:27,719] [INFO] [checkpointing.py:545:forward] ----Profiling time in checkpointing False
[2024-04-04 09:09:33][INFO][utils:145] - Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2024-04-04 09:09:33][INFO][utils:148] - Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
[2024-04-04 09:09:33][INFO][utils:160] - NumExpr defaulting to 8 threads.
^[c[2024-04-04 09:09:53,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 884.11 | optimizer_gradients: 6.43 | optimizer_step: 23.44
[2024-04-04 09:09:53,312] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[0.00029999999999267505, 0.00029999999999267505], mom=[(0.9, 0.999), (0.9, 0.999)]
[2024-04-04 09:09:53,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 6567.68 | bwd_microstep: 17950.36 | bwd_inner_microstep: 17711.20 | bwd_allreduce_microstep: 239.11 | step_microstep: 1139.27
[2024-04-04 09:09:53,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6567.66 | bwd: 17950.35 | bwd_inner: 17711.19 | bwd_allreduce: 239.11 | step: 1139.29
[Rank 0] (after 1 iterations) memory (MB) | allocated: 18244.640625 | max allocated: 41299.50146484375 | reserved: 46764.0 | max reserved: 46764.0
 iteration        1/  317892 | consumed samples:           96 | consumed tokens:       393216 | elapsed time per iteration (ms): 25849.1 | learning rate: 3.000E-04 | global batch size:    96 | lm loss: 1.117136E+01 | loss scale: 1.0 | actual seqlen:  4096 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 3.714 | tokens per gpu per second(tgs): 633.832 | TFLOPs: 38.61 |
[2024-04-04 09:10:13,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 327.85 | optimizer_gradients: 6.26 | optimizer_step: 23.60
[2024-04-04 09:10:13,619] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[0.00029999999997070033, 0.00029999999997070033], mom=[(0.9, 0.999), (0.9, 0.999)]
[2024-04-04 09:10:13,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 4022.74 | bwd_microstep: 15738.67 | bwd_inner_microstep: 15556.80 | bwd_allreduce_microstep: 181.82 | step_microstep: 371.01
[2024-04-04 09:10:13,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4022.73 | bwd: 15738.66 | bwd_inner: 15556.62 | bwd_allreduce: 181.81 | step: 371.02
 iteration        2/  317892 | consumed samples:          192 | consumed tokens:       786432 | elapsed time per iteration (ms): 20298.3 | learning rate: 3.000E-04 | global batch size:    96 | lm loss: 2.537718E+01 | loss scale: 1.0 | actual seqlen:  4096 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 4.729 | tokens per gpu per second(tgs): 807.159 | TFLOPs: 49.17 |

[Polaris]:

# [09:31:35 AM][foremans@x3112c0s13b0n0][~/pol/p/a/Megatron-DeepSpeed][🌱 main][$!?]
$ PBS_O_WORKDIR=$(pwd) DATA_FILE_LIST=./ALCF/data-lists/polaris/books.txt OPT=adamw bash train_aGPT_7B.sh
source-ing /lus/eagle/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/ALCF/helpers.sh
Running on Polaris !!

[python] Using: /eagle/datascience/foremans/miniconda3/envs/cu118-pt221/bin/python3
Saving {PATH, LD_LIBRARY_PATH, htt{p,ps}_proxy, CFLAGS, PYTHONUSERBASE} to .deepspeed_env
Found ezpz!
/lus/eagle/projects/datascience/foremans/tmp/Megatron-DeepSpeed/ezpz/src/ezpz/__init__.py
Has ezpz installed. Nothing to do.
Done with ezpz.
┌───────────────────────────────────────────────────────────────────
│ Writing PBS vars to /home/foremans/.pbsenv
│ HOSTFILE: /var/spool/pbs/aux/1822297.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
│ NHOSTS: 2
│ NGPU_PER_HOST: 4 GPUs per host
│ NGPUS: 8 GPUs total
└───────────────────────────────────────────────────────────────────
┌──────────────────────────────────────────────────────────────────
│ [Hosts]: 
│     • [host:0] - x3112c0s13b0n0.hsn.cm.polaris.alcf.anl.gov
│     • [host:1] - x3112c0s13b1n0.hsn.cm.polaris.alcf.anl.gov
└──────────────────────────────────────────────────────────────────
┌──────────────────────────────────────────────────────────────────
│ [DIST INFO]: 
│     • Loading job env from: /home/foremans/.pbsenv
│     • HOSTFILE: /var/spool/pbs/aux/1822297.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
│     • NHOSTS: 2
│     • NGPU_PER_HOST: 4
│     • NGPUS (NHOSTS x NGPU_PER_HOST): 8
│     • WORLD_SIZE: 8
│     • DIST_LAUNCH: mpiexec --verbose --envall -n 8 -ppn 4 --hostfile /var/spool/pbs/aux/1822297.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
└──────────────────────────────────────────────────────────────────
┌──────────────────────────────────────────────────────────────────
│ [Launch]:
│     • Use: 'launch' (=mpiexec --verbose --envall -n 8 -ppn 4 --hostfile /var/spool/pbs/aux/1822297.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov)
│       to launch job
└──────────────────────────────────────────────────────────────────
DS_CONFIG: ds_stage2_mb8_gb32_pp1_bf16.json
ZS: 2, CPU_OPTIMIZER: , MB: 8, GB: 32, PP: 1, DTYPE: bf16!!!Please see logs at logs/ds_stage2_nl32_hs4096_mb8_seq4096_gb32_pp1_tp2_bf16/0404093534_x3112c0s13b0n0
!! Caught USE_ACTIVATION_CHECKPOINTING=1 !!
!! Caught USE_ACTIVATION_CHECKPOINTING=1 !!
Calling:  setData() with "./convergence_debug_small.txt"
--------------------
Updated environment:
DATA_FILE_LIST: ./convergence_debug_small.txt
NUM_DOCS: 15
 WEIGHT_SUM: 15.0
DFL_STEM: convergence_debug_small
DATA_CACHE_PATH: /lus/eagle/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache
--------------------
++++++++++++++++++++++++++++++++++++++++++++++++++
- MPICH_DIR=/opt/cray/pe/mpich/8.1.25/ofi/gnu/9.1
- Using /eagle/datascience/foremans/miniconda3/envs/cu118-pt221/bin/python3
- WORLD_SIZE:8
- NCCL: nccl
- MODEL_TYPE: llama-seq4096-pp1-tp2-32layers-32heads-4096hidden
- Using DATA_FILE_LIST: ./convergence_debug_small.txt
++++++++++++++++++++++++++++++++++++++++++++++++++
! Using /eagle/datascience/foremans/miniconda3/envs/cu118-pt221/bin/deepspeed
[2024-04-04 09:35:35,959] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda [auto detect]
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/eagle/datascience/foremans/miniconda3/envs/cu118-pt221/lib/python3.12/site-packages/torch']
torch version .................... 2.2.1
deepspeed install path ........... ['/eagle/datascience/foremans/miniconda3/envs/cu118-pt221/lib/python3.12/site-packages/deepspeed']
deepspeed info ................... 0.14.0, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.2, cuda 11.8
shared memory (/dev/shm) size .... 251.61 GB

    deepspeed --hostfile /lus/eagle/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/hostfile_deepspeed --launcher MPICH /lus/eagle/projects/datascienc
e/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/pretrain_gpt_alcf.py     --bf16     --optimizer adamw     --split 100,0,0     --log-interval 1     --no-bias-gelu-fusion 
    --lr-decay-style cosine     --no-bias-dropout-fusion     --no-masked-softmax-fusion     --tokenizer-type Llama2Tokenizer     --no-gradient-accumulation-fusion     --accumulate-allreduce-
grads-in-fp32     --use-checkpoint-opt_param-scheduler     --tensorboard-dir checkpoints/ds_stage2_nl32_hs4096_mb8_seq4096_gb32_pp1_tp2_bf16/tensorboard     --log-timers-to-tensorboard     -
-log-optimizer-states-to-tensorboard     --lr 0.0003     --save checkpoints/ds_stage2_nl32_hs4096_mb8_seq4096_gb32_pp1_tp2_bf16     --load checkpoints/ds_stage2_nl32_hs4096_mb8_seq4096_gb32_
pp1_tp2_bf16     --seq-length 4096     --num-layers 32     --hidden-size 4096     --train-iters 317892     --eval-iters 10     --distributed-backend nccl     --num-attention-heads 32     --s
ave-interval 200     --eval-interval 50000     --max-position-embeddings 4096     --micro-batch-size 8     --data-file-list ./convergence_debug_small.txt     --tensor-model-parallel-size 2  
   --global-batch-size 32     --pipeline-model-parallel-size 1     --num-key-value-heads 8     --data-cache-path /lus/eagle/projects/datascience/foremans/locations/polaris/projects/argonne-l
cf/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache     --ffn-hidden-size 11008     --tokenizer-model /home/foremans/polaris/projects/argonne-lcf/Megatron-DeepSpeed/ALCF/tokeniz
er.model     --no-query-key-layer-scaling --use-rotary-position-embeddings --untie-embeddings-and-output-weights --swiglu --normalization rmsnorm --disable-bias-linear --use-flash-attn-v2   
   --deepspeed-activation-checkpointing  --zero-stage=2  --deepspeed_config=ds_stage2_mb8_gb32_pp1_bf16.json  --no-pipeline-parallel  --deepspeed       --checkpoint-activations --checkpoint-
num-layers 1           |& tee logs/ds_stage2_nl32_hs4096_mb8_seq4096_gb32_pp1_tp2_bf16/0404093534_x3112c0s13b0n0/output.log

[!! NOTE] View output at:
logs/ds_stage2_nl32_hs4096_mb8_seq4096_gb32_pp1_tp2_bf16/0404093534_x3112c0s13b0n0/output.log

# ...

/eagle/datasets/dolma/data_Llama2Tokenizer/common-crawl/cc_en_middle/cc_en_middle-0051_text_document.bin
    creating memory view of numpy buffer...
 > finished creating indexed dataset in 0.001280 seconds
    number of documents: 1498927
 > dataset split:
    train:
     document indices in [0, 1498927) total of 1498927 documents
    validation:
     document indices in [1498927, 1498927) total of 0 documents
    test:
     document indices in [1498927, 1498927) total of 0 documents
 > loading doc-idx mapping from /lus/eagle/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache/9217d94f3290abc2fddf9e87bff236d6_doc_idx.npy
 > loading sample-idx mapping from /lus/eagle/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache/9217d94f3290abc2fddf9e87bff236d6_sample_idx.npy
 > loading shuffle-idx mapping from /lus/eagle/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache/9217d94f3290abc2fddf9e87bff236d6_shuffle_idx.npy
    loaded indexed file in 0.004 seconds
    total number of samples: 869423
    total number of epochs: 3
> loading blendable dataset index: /lus/eagle/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache/a815d51f6752c6f486d94194ce95fb87_index.npy
> loading blendable dataset sample index: /lus/eagle/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/.cache/convergence_debug_small/index-cache/a815d51f6752c6f486d94194ce95fb87_sample_index.npy
> size of blendable dataset: 10223415 samples
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-04-04 09:36:07
done with setup ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (4794.78, 4795.23)
    train/valid/test-data-iterators-setup ..........: (589.69, 721.20)
training ...
[before the start of training step] datetime: 2024-04-04 09:36:07
[2024-04-04 09:36:07,407] [INFO] [checkpointing.py:539:forward] Activation Checkpointing Information
[2024-04-04 09:36:07,407] [INFO] [checkpointing.py:540:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-04-04 09:36:07,407] [INFO] [checkpointing.py:541:forward] ----contiguous Memory Checkpointing False with 32 total layers
[2024-04-04 09:36:07,407] [INFO] [checkpointing.py:543:forward] ----Synchronization False
[2024-04-04 09:36:07,407] [INFO] [checkpointing.py:544:forward] ----Profiling time in checkpointing False
[2024-04-04 09:36:28,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1626.54 | optimizer_gradients: 19.29 | optimizer_step: 419.48
[2024-04-04 09:36:28,430] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[0.00029999999999267505, 0.00029999999999267505], mom=[(0.9, 0.999), (0.9, 0.999)]
[2024-04-04 09:36:28,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 11336.34 | bwd_microstep: 7134.73 | bwd_inner_microstep: 7090.02 | bwd_allreduce_microstep: 44.65 | step_microstep: 2564.02
[2024-04-04 09:36:28,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 11336.33 | bwd: 7134.75 | bwd_inner: 7090.01 | bwd_allreduce: 44.66 | step: 2564.02
 iteration        1/  317892 | consumed samples:           32 | consumed tokens:       131072 | elapsed time per iteration (ms): 21133.8 | learning rate: 3.000E-04 | global batch size:    32 | lm loss: 1.119983E+01 | loss scale: 1.0 | actual seqlen:  4096 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.514 | tokens per gpu per second(tgs): 775.250 | TFLOPs: 47.23 |
[Rank 1] (after 1 iterations) memory (MB) | allocated: 14165.525390625 | max allocated: 22332.37255859375 | reserved: 24642.0 | max reserved: 35824.0
[Rank 0] (after 1 iterations) memory (MB) | allocated: 14165.525390625 | max allocated: 22332.37255859375 | reserved: 24642.0 | max reserved: 32994.0
[2024-04-04 09:36:38,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1605.55 | optimizer_gradients: 11.56 | optimizer_step: 50.92
[2024-04-04 09:36:38,623] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[0.00029999999997070033, 0.00029999999997070033], mom=[(0.9, 0.999), (0.9, 0.999)]
[2024-04-04 09:36:38,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1395.17 | bwd_microstep: 6832.48 | bwd_inner_microstep: 6789.73 | bwd_allreduce_microstep: 42.70 | step_microstep: 1867.64
[2024-04-04 09:36:38,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1395.15 | bwd: 6832.49 | bwd_inner: 6789.73 | bwd_allreduce: 42.71 | step: 1867.65
 iteration        2/  317892 | consumed samples:           64 | consumed tokens:       262144 | elapsed time per iteration (ms): 10154.3 | learning rate: 3.000E-04 | global batch size:    32 | lm loss: 1.766422E+01 | loss scale: 1.0 | actual seqlen:  4096 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 3.151 | tokens per gpu per second(tgs): 1613.503 | TFLOPs: 98.29 |

# ...

🚀 Submit as a batch job

$ cd Megatron-DeepSpeed
$ qsub -A <your-project> -q debug -l select=2 -l walltime=01:00:00,filesystems=eagle:home train_aGPT_7B.sh

📝 Data Preprocessing

Data Pre-Processing:

AuroraGPT is trained on the Dolma dataset (initially v0), now in the process of moving to v6. For more details on the dataset, refer to https://huggingface.co/datasets/allenai/dolma. The dolma dataset downloaded is already preprocessing to remove the duplicates (dedup) and filtering the data (mixing). For more details refer to https://github.com/allenai/dolma/tree/main/docs and https://github.com/vksastry/dolma_alcf/blob/main/ALCF/Readme.md.

The data preprocessing of Dolma dataset before training consists of tokenization of the data using a specific tokenizer (LlamaTokenizer is what we are currently using), Use the below script to tokenize the entire dataset. Example shown for Polaris.

cd /eagle/datasets/dolma/utils
./tokenization.sh

✅ TODOs

TODOs:

Ensure / double check that optimizer settings from ds_config.json aren't being overwritten by some defaults in megatron/arguments.py
- specifically, momentum, beta{1, 2}, etc

✅ Completed

Continue runs on Polaris @
- 48 Nodes
- 32 Nodes
- 16 Nodes
- 8 Nodes
- 4 Nodes
Then, try re-creating ( / fixing) conda with cuda==12.1
- 😔, failed.

~~‼️ Unable to save checkpoints with torch==2.1 + cuda==11.8~~:

Fixed in a57a21f

🐛 Bug

Training progresses OK:

[2024-03-07 15:27:02,646] [INFO] [timer.py:260:stop] epoch=0/micro_step=199/global_step=199, RunningAvgSamplesPerSec=58.730622229657506, CurrSamplesPerSec=61.35304005128382, MemAllocated=6.01GB, MaxMemAllocated=19.52GB
iteration      199/  317892 | consumed samples:       152832 | consumed tokens:    625999872 | elapsed time per iteration (ms): 14287.5 | learning rate: 2.407E-04 | global batch size:   768 | lm loss: 5.905366E+00 | loss scale: 8192.0 | actual seqlen:  4096 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 53.753 | tokens per gpu per second (tgs): 1146.733 | TFLOPs: 69.85 |
[2024-03-07 15:27:15,063] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=4, lr=[0.000240653265864008, 0.000240653265864008], mom=[(0.9, 0.999), (0.9, 0.999)]
[2024-03-07 15:27:17,188] [INFO] [timer.py:260:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=58.730745476291396, CurrSamplesPerSec=58.75503515561452, MemAllocated=6.01GB, MaxMemAllocated=19.52GB
iteration      200/  317892 | consumed samples:       153600 | consumed tokens:    629145600 | elapsed time per iteration (ms): 14541.4 | learning rate: 2.407E-04 | global batch size:   768 | lm loss: 5.897035E+00 | loss scale: 8192.0 | actual seqlen:  4096 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 52.815 | tokens per gpu per second (tgs): 1126.713 | TFLOPs: 68.63 |
saving checkpoint at iteration     200 to checkpoints/ds_stage2_nl32_hs4096_mb8_seq4096_gb768_pp1_tp2_fp16
# ...

Then crashes with:

Traceback (most recent call last):
Traceback (most recent call last):
  File "/lus/eagle/projects/datascience/foremans/tmp/Megatron-DeepSpeed/pretrain_gpt_alcf.py", line 575, in <module>
    model = main()
  File "/lus/eagle/projects/datascience/foremans/tmp/Megatron-DeepSpeed/pretrain_gpt_alcf.py", line 554, in main
    model = pretrain(
  File "/lus/eagle/projects/datascience/foremans/tmp/Megatron-DeepSpeed/megatron/training.py", line 226, in pretrain
    iteration = train(forward_step_func,
  File "/lus/eagle/projects/datascience/foremans/tmp/Megatron-DeepSpeed/megatron/training.py", line 1290, in train
    save_checkpoint_and_time(iteration, model, optimizer,
  File "/lus/eagle/projects/datascience/foremans/tmp/Megatron-DeepSpeed/megatron/training.py", line 1151, in save_checkpoint_and_time
    save_checkpoint(iteration, model, optimizer, opt_param_scheduler)
  File "/lus/eagle/projects/datascience/foremans/tmp/Megatron-DeepSpeed/megatron/checkpointing.py", line 259, in save_checkpoint
    state_dict[UNIVERSAL_CHECKPOINT_INFO] = _universal_checkpoint_info(model)
  File "/lus/eagle/projects/datascience/foremans/tmp/Megatron-DeepSpeed/megatron/checkpointing.py", line 783, in _universal_checkpoint_info
    info.update(model[0].universal_checkpoint_info())
  File "/lus/eagle/projects/datascience/foremans/tmp/Megatron-DeepSpeed/megatron/model/gpt_model.py", line 203, in universal_checkpoint_info
    info[TP_REPLICATED_PARAMETER_PATTERNS] = self._get_tp_replicated_param_patterns()
  File "/lus/eagle/projects/datascience/foremans/miniconda3/envs/polaris/2024-03-06/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'GPTModel' object has no attribute '_get_tp_replicated_param_patterns'

🤔

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALCF

ALCF

README.md

Megatron-DeepSpeed @ ALCF

🏃‍♂️ Running

🚀 Submit as a batch job

📝 Data Preprocessing

✅ TODOs

Files

ALCF

Directory actions

More options

Directory actions

More options

Latest commit

History

ALCF

Folders and files

parent directory

README.md

Megatron-DeepSpeed @ ALCF

🏃‍♂️ Running

🚀 Submit as a batch job

📝 Data Preprocessing

✅ TODOs