Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretraining an OLMo model on the SlimPajama dataset #1837

Open
aflah02 opened this issue Nov 26, 2024 · 16 comments
Open

Pretraining an OLMo model on the SlimPajama dataset #1837

aflah02 opened this issue Nov 26, 2024 · 16 comments
Labels
question Further information is requested

Comments

@aflah02
Copy link
Contributor

aflah02 commented Nov 26, 2024

Hi!
I am planning to test pretraining OLMo 1B model on the slim pajama dataset. I was trying to follow the tutorial for tinyllama but one of the steps for preparing the dataset uses the litgpt/data/prepare_slimpajama.py file which seems to be missing to me in the repo. Any workarounds for this?

@aflah02 aflah02 added the question Further information is requested label Nov 26, 2024
@aflah02
Copy link
Contributor Author

aflah02 commented Nov 27, 2024

CC: @rasbt @Andrei-Aksionov
Just bumping this on your radar as this is a continuation to the OLMo PR

@Andrei-Aksionov
Copy link
Collaborator

Hello @aflah02
Good catch!
It looks like this file was accidentally deleted in one of the recent PRs: https://github.com/Lightning-AI/litgpt/pull/1821/files#diff-2646bbbf72cb6e84cfc29a226b4446985b6904dc04b6228ef8a69d9fcb4a2951

Could you bring it back in a PR?

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 27, 2024

Sure, I'll do that

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 29, 2024

I tried using the code to process the dataset however it doesn't seem to work for the train set due to size issues. Is there a way to reduce how many things are moved to/kept in the tmp dir?

Error -

OSError: [Errno 28] No space left on device: '/NS/llm-1/static00/data/slimpajama-raw/train/chunk2/example_train_4825.jsonl.zst' -> '/tmp/data/chunk2/example_train_4825.jsonl.zst'
OSError: [Errno 28] No space left on device: '/NS/llm-1/static00/data/slimpajama-raw/train/chunk5/example_train_2785.jsonl.zst' -> '/tmp/data/chunk5/example_train_2785.jsonl.zst'
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litdata/processing/data_processor.py", line 167, in _download_data_target
shutil.copyfile(path, local_path)
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/shutil.py", line 269, in copyfile
_fastcopy_sendfile(fsrc, fdst)
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/shutil.py", line 158, in _fastcopy_sendfile
raise err from None
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/shutil.py", line 144, in _fastcopy_sendfile
sent = os.sendfile(outfd, infd, offset, blocksize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 28] No space left on device: '/NS/llm-1/static00/data/slimpajama-raw/train/chunk4/example_train_401.jsonl.zst' -> '/tmp/data/chunk4/example_train_401.jsonl.zst'
Progress: 2%|██▎ | 1157/59166 [24:37<20:34:42, 1.28s/it]
``

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 29, 2024

A simple fix that I'm using is to create a symlink with my NFS where I have more storage with /tmp/data and then running it. It seems to run for now (still in progress)

@aflah02
Copy link
Contributor Author

aflah02 commented Dec 4, 2024

Hey @Andrei-Aksionov @rasbt

I was trying to set up a multinode run via SLURM and was testing this on 2 nodes with ethernet based interconnect however the init fails -

/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.11 /NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_li ...
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/16
[W1204 13:17:19.038836275 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/16
[W1204 13:18:00.218036449 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 6] Seed set to 42
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/16
[W1204 13:18:00.580590861 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 1] Seed set to 42
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/16
[W1204 13:18:00.633244692 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 7] Seed set to 42
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/16
[W1204 13:18:01.815471680 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 3] Seed set to 42
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/16
[W1204 13:18:01.876939030 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 4] Seed set to 42
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/16
[W1204 13:18:01.918607039 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/16
[rank: 5] Seed set to 42
[W1204 13:18:01.934432434 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 2] Seed set to 42
Traceback (most recent call last):
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
           ^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 149, in setup
    fabric.launch()
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 843, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 928, in _wrap_and_launch
    return launcher.launch(to_run, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/strategies/launchers/subprocess_script.py", line 107, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 932, in _wrap_with_setup
    self._strategy.setup_environment()
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/strategies/fsdp.py", line 260, in setup_environment
    self._setup_distributed()
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/strategies/fsdp.py", line 671, in _setup_distributed
    _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 297, in _init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1361, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 258, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 8/16 clients joined.
[rank4]: Traceback (most recent call last):
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank4]:     sys.exit(main())
[rank4]:              ^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank4]:     CLI(parser_data)
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank4]:     return _run_component(component, init.get(subcommand))
[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank4]:     return component(**cfg)
[rank4]:            ^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank4]:     main(
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank4]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank4]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank4]:     with fabric.rank_zero_first():
[rank4]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank4]:     return next(self.gen)
[rank4]:            ^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank4]:     with _InfiniteBarrier() as barrier:
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank4]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank4]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank4]:     func_return = func(*args, **kwargs)
[rank4]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank4]:     return _new_group_with_tag(
[rank4]:            ^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank4]:     pg, pg_store = _new_process_group_helper(
[rank4]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank4]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank4]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: torch.distributed.DistNetworkError: Connection reset by peer
[rank6]: Traceback (most recent call last):
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank6]:     sys.exit(main())
[rank6]:              ^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank6]:     CLI(parser_data)
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank6]:     return _run_component(component, init.get(subcommand))
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank6]:     return component(**cfg)
[rank6]:            ^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank6]:     main(
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank6]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank6]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank6]:     with fabric.rank_zero_first():
[rank6]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank6]:     return next(self.gen)
[rank6]:            ^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank6]:     with _InfiniteBarrier() as barrier:
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank6]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank6]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank6]:     func_return = func(*args, **kwargs)
[rank6]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank6]:     return _new_group_with_tag(
[rank6]:            ^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank6]:     pg, pg_store = _new_process_group_helper(
[rank6]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank6]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank6]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: torch.distributed.DistNetworkError: Connection reset by peer
[rank2]: Traceback (most recent call last):
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank2]:     sys.exit(main())
[rank2]:              ^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank2]:     CLI(parser_data)
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank2]:     return _run_component(component, init.get(subcommand))
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank2]:     return component(**cfg)
[rank2]:            ^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank2]:     main(
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank2]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank2]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank2]:     with fabric.rank_zero_first():
[rank2]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank2]:     return next(self.gen)
[rank2]:            ^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank2]:     with _InfiniteBarrier() as barrier:
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank2]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank2]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank2]:     func_return = func(*args, **kwargs)
[rank2]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank2]:     return _new_group_with_tag(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank2]:     pg, pg_store = _new_process_group_helper(
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank2]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank2]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: torch.distributed.DistNetworkError: Connection reset by peer
[rank7]: Traceback (most recent call last):
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank7]:     sys.exit(main())
[rank7]:              ^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank7]:     CLI(parser_data)
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank7]:     return _run_component(component, init.get(subcommand))
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank7]:     return component(**cfg)
[rank7]:            ^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank7]:     main(
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank7]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank7]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank7]:     with fabric.rank_zero_first():
[rank7]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank7]:     return next(self.gen)
[rank7]:            ^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank7]:     with _InfiniteBarrier() as barrier:
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank7]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank7]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank7]:     func_return = func(*args, **kwargs)
[rank7]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank7]:     return _new_group_with_tag(
[rank7]:            ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank7]:     pg, pg_store = _new_process_group_helper(
[rank7]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank7]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank7]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: torch.distributed.DistNetworkError: Connection reset by peer
[rank1]: Traceback (most recent call last):
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank1]:     sys.exit(main())
[rank1]:              ^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank1]:     CLI(parser_data)
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank1]:     return _run_component(component, init.get(subcommand))
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank1]:     return component(**cfg)
[rank1]:            ^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank1]:     main(
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank1]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank1]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank1]:     with fabric.rank_zero_first():
[rank1]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]:     return next(self.gen)
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank1]:     with _InfiniteBarrier() as barrier:
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank1]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank1]:     func_return = func(*args, **kwargs)
[rank1]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank1]:     return _new_group_with_tag(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank1]:     pg, pg_store = _new_process_group_helper(
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank1]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistNetworkError: Connection reset by peer
[rank5]: Traceback (most recent call last):
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank5]:     sys.exit(main())
[rank5]:              ^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank5]:     CLI(parser_data)
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank5]:     return _run_component(component, init.get(subcommand))
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank5]:     return component(**cfg)
[rank5]:            ^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank5]:     main(
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank5]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank5]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank5]:     with fabric.rank_zero_first():
[rank5]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank5]:     return next(self.gen)
[rank5]:            ^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank5]:     with _InfiniteBarrier() as barrier:
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank5]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank5]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank5]:     func_return = func(*args, **kwargs)
[rank5]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank5]:     return _new_group_with_tag(
[rank5]:            ^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank5]:     pg, pg_store = _new_process_group_helper(
[rank5]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank5]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank5]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]: torch.distributed.DistNetworkError: Connection reset by peer
[rank3]: Traceback (most recent call last):
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank3]:     sys.exit(main())
[rank3]:              ^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank3]:     CLI(parser_data)
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank3]:     return _run_component(component, init.get(subcommand))
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank3]:     return component(**cfg)
[rank3]:            ^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank3]:     main(
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank3]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank3]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank3]:     with fabric.rank_zero_first():
[rank3]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank3]:     return next(self.gen)
[rank3]:            ^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank3]:     with _InfiniteBarrier() as barrier:
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank3]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank3]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank3]:     func_return = func(*args, **kwargs)
[rank3]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank3]:     return _new_group_with_tag(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank3]:     pg, pg_store = _new_process_group_helper(
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank3]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank3]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistNetworkError: Connection reset by peer

I also see this warning -

Warning: Not all GPUs are fully connected via NVLink. Some GPUs are connected via slower interfaces. It is recommended to switch to a different machine with faster GPU connections for optimal multi-GPU training performance.

Here's the config -


# The name of the model to pretrain. Choose from names in ``litgpt.config``. Mutually exclusive with
# ``model_config``. (type: Optional[str], default: null)
model_name: allenai/OLMo-1B-hf

# A ``litgpt.Config`` object to define the model architecture. Mutually exclusive with
# ``model_config``. (type: Optional[Config], default: null)
model_config:

# Directory in which to save checkpoints and logs. If running in a Lightning Studio Job, look for it in
# /teamspace/jobs/<job-name>/share. (type: <class 'Path'>, default: out/pretrain)
out_dir: out/pretrain/slim-olmo-2x8xH100-GBS-192

# The precision to use for pretraining. Possible choices: "bf16-true", "bf16-mixed", "32-true". (type: Optional[str], default: null)
precision: bf16-mixed

# Optional path to a checkpoint directory to initialize the model from.
# Useful for continued pretraining. Mutually exclusive with ``resume``. (type: Optional[Path], default: null)
initial_checkpoint_dir:

# Path to a checkpoint directory to resume from in case training was interrupted, or ``True`` to resume
# from the latest checkpoint in ``out_dir``. An error will be raised if no checkpoint is found. Passing
# ``'auto'`` will resume from the latest checkpoint but not error if no checkpoint exists.
# (type: Union[bool, Literal["auto"], Path], default: False)
resume: false

# Data-related arguments. If not provided, the default is ``litgpt.data.TinyLlama``.
data: MicroLlama
# Path - /NS/llm-1/static00/data/slimpajama

# Training-related arguments. See ``litgpt.args.TrainArgs`` for details
train:

  # Number of optimizer steps between saving checkpoints (type: Optional[int], default: 1000)
  save_interval: 100000

  # Number of iterations between logging calls (type: int, default: 1)
  log_interval: 1

  # Number of samples between optimizer steps across data-parallel ranks (type: int, default: 48)
  # Scale this number according to the number of GPU and memory size per GPU
  # For example, we used 48 for 4 x 24G 4090 
  global_batch_size: 192

  # Number of samples per data-parallel rank (type: int, default: 12)
  # Scale this number according to the memory size per GPU
  # For example, we used 12 for 24G 4090
  micro_batch_size: 12

  # Number of iterations with learning rate warmup active (type: int, default: 2000)
  lr_warmup_steps: 2000

  # Number of epochs to train on (type: Optional[int], default: null)
  epochs:

  # Total number of tokens to train on (type: Optional[int], default: 3000000000000)
  max_tokens: 3000000000000

  # Limits the number of optimizer steps to run. (type: Optional[int], default: null)
  max_steps:

  # Limits the length of samples. Off by default (type: Optional[int], default: null)
  max_seq_length: 2048

  # Whether to tie the embedding weights with the language modeling head weights. (type: Optional[bool], default: False)
  tie_embeddings:

  #   (type: Optional[float], default: 1.0)
  max_norm: 1.0

  #   (type: float, default: 4e-05)
  min_lr: 4.0e-05

# Evaluation-related arguments. See ``litgpt.args.EvalArgs`` for details
eval:

  # Number of optimizer steps between evaluation calls (type: int, default: 1000)
  interval: 1000

  # Number of tokens to generate (type: Optional[int], default: null)
  max_new_tokens:

  # Number of iterations (type: int, default: 100)
  max_iters: 100

  # Whether to evaluate on the validation set at the beginning of the training
  initial_validation: false

# Optimizer-related arguments
optimizer:

  class_path: torch.optim.AdamW
  
  init_args:
    
    #   (type: float, default: 0.001)
    lr: 4e-4
    
    #   (type: float, default: 0.01)
    weight_decay: 0.1
    
    #   (type: tuple, default: (0.9,0.999))
    betas:
      - 0.9
      - 0.95

# How many devices/GPUs to use. Uses all GPUs by default. (type: Union[int, str], default: auto)
devices: auto

# How many nodes to use. (type: int, default: 1)
num_nodes: 2

# Optional path to the tokenizer dir that was used for preprocessing the dataset. Only some data
# module require this. (type: Optional[Path], default: null)
tokenizer_dir: checkpoints/allenai/OLMo-1B-hf

# The name of the logger to send metrics to. (type: Literal['wandb', 'tensorboard', 'csv'], default: tensorboard)
logger_name: wandb

# The random seed to use for reproducibility. (type: int, default: 42)
seed: 42

This is my run command -

sbatch --partition=a100 --nodes=2 --gres=gpu:8 --cpus-per-task=32 --mem=244G --exclude=sws-3a100grid-01 --time=8-00:00 --output=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xH100-GBS-192.out --error=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xH100-GBS-192.err --job-name=litgpt-olmo-pretrain-slimpajama-2x8xH100-GBS-192 --wrap "litgpt pretrain --config config_hub/pretrain/slimolmo.yaml --data.data_path /NS/llm-1/static00/data/"

The code works when running on 1 node

Any clue what might be going wrong? I am using SLURM btw

@aflah02
Copy link
Contributor Author

aflah02 commented Dec 4, 2024

nvidia-smi on the nodes (before timeout based crash) -

image

image

So one of the nodes doesn't really load anything

@aflah02
Copy link
Contributor Author

aflah02 commented Dec 4, 2024

I just realized the error message and this tutorial (https://lightning.ai/docs/fabric/stable/guide/multi_node/slurm.html) seems to imply I should use srun. Running with this now -

sbatch --partition=a100 --nodes=2 --gres=gpu:8 --cpus-per-task=32 --mem=244G --exclude=sws-3a100grid-01 --time=8-00:00 --output=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.out --error=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.err --job-name=litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192 --wrap "srun litgpt pretrain --config config_hub/pretrain/slimolmo.yaml --data.data_path /NS/llm-1/static00/data/"

@aflah02
Copy link
Contributor Author

aflah02 commented Dec 4, 2024

This command works -

sbatch --partition=a100 --nodes=2 --gres=gpu:8 --ntasks-per-node=8 --mem=244G --exclude=sws-3a100grid-01 --time=8-00:00 --output=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.out --error=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.err --job-name=litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192 --wrap "srun litgpt pretrain --config config_hub/pretrain/slimolmo.yaml --data.data_path /NS/llm-1/static00/data/"

But when I look at wandb I only see logs for one node (even though the loss is aggregated prior to backprop I don't see any device stats for the other node)

@aflah02
Copy link
Contributor Author

aflah02 commented Dec 16, 2024

Hi @Andrei-Aksionov @rasbt
I was trying to figure out the best batch size for pretraining OLMo 1B on A100 machines. I tried a lot of different batch sizes but everything OOMs except for batch size 12 which is quite surprising as that is the recommended batch size for Tiny Llama on a 24 GB 4090 while I have an 80GB A100 machine that I'm testing on. Any ideas what could be going wrong?

Here is my config -


# The name of the model to pretrain. Choose from names in ``litgpt.config``. Mutually exclusive with
# ``model_config``. (type: Optional[str], default: null)
model_name: allenai/OLMo-1B-hf

# A ``litgpt.Config`` object to define the model architecture. Mutually exclusive with
# ``model_config``. (type: Optional[Config], default: null)
model_config:

# Directory in which to save checkpoints and logs. If running in a Lightning Studio Job, look for it in
# /teamspace/jobs/<job-name>/share. (type: <class 'Path'>, default: out/pretrain)
out_dir: out/pretrain/slim-olmo-1x1xA100-GBS-24

# The precision to use for pretraining. Possible choices: "bf16-true", "bf16-mixed", "32-true". (type: Optional[str], default: null)
precision: bf16-mixed

# Optional path to a checkpoint directory to initialize the model from.
# Useful for continued pretraining. Mutually exclusive with ``resume``. (type: Optional[Path], default: null)
initial_checkpoint_dir:

# Path to a checkpoint directory to resume from in case training was interrupted, or ``True`` to resume
# from the latest checkpoint in ``out_dir``. An error will be raised if no checkpoint is found. Passing
# ``'auto'`` will resume from the latest checkpoint but not error if no checkpoint exists.
# (type: Union[bool, Literal["auto"], Path], default: False)
resume: false

# Data-related arguments. If not provided, the default is ``litgpt.data.TinyLlama``.
data: MicroLlama
# Path - /NS/llm-1/static00/data/slimpajama

# Training-related arguments. See ``litgpt.args.TrainArgs`` for details
train:

  # Number of optimizer steps between saving checkpoints (type: Optional[int], default: 1000)
  save_interval: 100000

  # Number of iterations between logging calls (type: int, default: 1)
  log_interval: 1

  # Number of samples between optimizer steps across data-parallel ranks (type: int, default: 48)
  # Scale this number according to the number of GPU and memory size per GPU
  # For example, we used 48 for 4 x 24G 4090 
  global_batch_size: 24

  # Number of samples per data-parallel rank (type: int, default: 12)
  # Scale this number according to the memory size per GPU
  # For example, we used 12 for 24G 4090
  micro_batch_size: 24

  # Number of iterations with learning rate warmup active (type: int, default: 2000)
  lr_warmup_steps: 2000

  # Number of epochs to train on (type: Optional[int], default: null)
  epochs:

  # Total number of tokens to train on (type: Optional[int], default: 3000000000000)
  max_tokens: 3000000000000

  # Limits the number of optimizer steps to run. (type: Optional[int], default: null)
  max_steps:

  # Limits the length of samples. Off by default (type: Optional[int], default: null)
  max_seq_length: 2048

  # Whether to tie the embedding weights with the language modeling head weights. (type: Optional[bool], default: False)
  tie_embeddings:

  #   (type: Optional[float], default: 1.0)
  max_norm: 1.0

  #   (type: float, default: 4e-05)
  min_lr: 4.0e-05

# Evaluation-related arguments. See ``litgpt.args.EvalArgs`` for details
eval:

  # Number of optimizer steps between evaluation calls (type: int, default: 1000)
  interval: 1000

  # Number of tokens to generate (type: Optional[int], default: null)
  max_new_tokens:

  # Number of iterations (type: int, default: 100)
  max_iters: 100

  # Whether to evaluate on the validation set at the beginning of the training
  initial_validation: false

# Optimizer-related arguments
optimizer:

  class_path: torch.optim.AdamW
  
  init_args:
    
    #   (type: float, default: 0.001)
    lr: 4e-4
    
    #   (type: float, default: 0.01)
    weight_decay: 0.1
    
    #   (type: tuple, default: (0.9,0.999))
    betas:
      - 0.9
      - 0.95

# How many devices/GPUs to use. Uses all GPUs by default. (type: Union[int, str], default: auto)
devices: auto

# How many nodes to use. (type: int, default: 1)
num_nodes: 1

# Optional path to the tokenizer dir that was used for preprocessing the dataset. Only some data
# module require this. (type: Optional[Path], default: null)
tokenizer_dir: checkpoints/allenai/OLMo-1B-hf

# The name of the logger to send metrics to. (type: Literal['wandb', 'tensorboard', 'csv'], default: tensorboard)
logger_name: wandb

# The random seed to use for reproducibility. (type: int, default: 42)
seed: 42

I tried on a single GPU as well as 8xA100 machines and I get the same OOMs

@aflah02
Copy link
Contributor Author

aflah02 commented Dec 16, 2024

I looked at numbers from the Pythia paper and while training the 1B model they were able to use a batch size of 16 for a 40 GB A100 but I can't use that for OLMo 1B despite having a 2x larger GPU

@aflah02
Copy link
Contributor Author

aflah02 commented Dec 16, 2024

Here's the WANDB GPU Usage Chart for Batch Size 16 -

image

@Andrei-Aksionov
Copy link
Collaborator

Just a guess.
Try to do a memory profiling.
I can image that you will find a spike in memory consumption, caused by one of the examples.

 # Limits the length of samples. Off by default (type: Optional[int], default: null)
  max_seq_length: 2048

It might be there are only a couple of samples in the training set that have such a length.
And because of them and the spike that they cause, you cannot enlarge the batch size.

But it's only a guess :)

@aflah02
Copy link
Contributor Author

aflah02 commented Dec 16, 2024

I do plan to but I think even if the entire batch was this big it should still not OOM as Pythia had the same seq length and a GPU with half the size but still worked with larger batch sizes

@Andrei-Aksionov
Copy link
Collaborator

I looked at numbers from the Pythia paper and while training the 1B model they were able to use a batch size of 16 for a 40 GB A100 but I can't use that for OLMo 1B despite having a 2x larger GPU

To better isolate the problem, could you try to repeat Pythia 1B with 40 batch size.

@aflah02
Copy link
Contributor Author

aflah02 commented Dec 16, 2024

Thanks I'll do that

Also is there a simple way to use the profiler when pretraining? or do I need to modify pretrain.py and add the profiler in manually?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants