selective compilation - norm layers only #320

lessw2020 · 2024-05-10T00:28:42Z

This PR adds the option to selectively compile just the norm layers only, and is mainly targeted at RMSNorm.
By compiling just the norm layers when using rmsnorm, we get nearly comparable speedups as using the fusedRMSNorm triton kernel.
Credit @wconstab for this idea.

regular rmsnorm:

with the new compile_rmsnorm enabled:

2 - UX - I enabled the compile rmsnorm as it's own option for now so users can quickly try whole model or norm only compile. If compile is true, then the rmsnorm layers will not specifically be compiled (as they will be included in the generic full model compile) and a minor note is issued in logging.

3 - using other norms with this option enabled does not appear to add any speedup (but also no errors) so I did not add a check to only compile if norm is rmsnorm (but can add that).

it's a small thing and can be download from OSS, we can just check in

This PR adds the following: 1 - via reset parameters, a full layerwise init for the llama models under /llama. This uses the total model depth as part of the init via: self.weight_init_std = 0.02 / (2 * self.num_layers) ** 0.5 2 - The final output ffn (head) is init with sqrt of the dim of the model itself and a slightly wider cutoff factor of 3. 3 - tangential change - updates run_llama_train.sh with updated MODEL and MODEL_CONF params to allow for direct model control via the sh script. (there was a MODEL already but it was incorrectly using that in place of MODEL_CONF...though we should update this as it's not intuitive). 4 - made the debugmodel default to 2 layers as an improved debug check. 5 - added a 1B and 40B for additional testing configs. I can't currently run 70B on my H100 due to OOM, but can run 40B. Testing: Verified proper init and training with 7B, 13B and ~40B: <img width="1085" alt="Screenshot 2024-02-11 at 10 39 12 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/049037ed-63a4-4ab0-bebc-f297857aab72">

This PR is the start of adding perf related metrics. 1 - This PR adds function for logging the total num of unique model params, with option for only counting trainable params as well. (for future peft/qlora type work). 2 - logs it with comma formatted logging and model name ala: <img width="716" alt="Screenshot 2024-02-12 at 4 12 22 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/8eb48870-ab1e-4b70-9159-92864ff6c0e5"> this helps de-mistify for example the size of our debug model as well: <img width="716" alt="Screenshot 2024-02-12 at 4 10 17 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77475306-54bc-48a6-bf28-9c9a542577fd"> **additional updates** - added in gpu mem tracking. We want to show the user peak memory stats, as well as monitor and alert for any cudacachealloc retries which are a perf hindrance. Thus, added class GPUMemoryMonitor: usage: 1 - instantiate <img width="1329" alt="Screenshot 2024-02-13 at 9 32 11 AM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/95610386-6fde-47bb-bbdc-bb7c399c5895"> 2 - start of training = start_monitoring() 3 - end of training = stop_monitoring() 4 - show results = get_peak_stats_str() and rank0_log it. <img width="1074" alt="Screenshot 2024-02-13 at 9 12 45 AM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/b6c7c854-7d83-436a-bea9-a67109422381">

ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4 Pull Request resolved: pytorch#57

ghstack-source-id: da7e02b1c2f21a7471ce1dda8bd4d0ee888ad9ac Pull Request resolved: pytorch#60

ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939 Pull Request resolved: pytorch#65

…ix (pytorch#63) This PR 1 - adds multi-node training support via a multinode_trainer.slurm file. Verified llama 7b on 20 nodes / 160 A100s. 2 - It also corrects a race condition that can occur on larger scale training in profiling, where the check for the trace dir existence fails for process 1, but in the interim another process 2 makes the directory, and then when process 1 tries to make the dir it errors out as the dir exists. This is a simple fix of adding exist_ok=True to both of the makedir command (dump folder, trace folder). <img width="1047" alt="Screenshot 2024-02-15 at 10 53 18 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/20378637-4adb-425b-91d8-7fd36289d3b5"> <img width="545" alt="Screenshot 2024-02-15 at 10 55 02 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/28658614-cff6-42b5-ab57-bac578393d5c">

…orch#64) Small PR: 1 - add configurable init style in model_args - 'use_unique_init' will use the layer_id in the init stddev denom, otherwise uses the original init style of total layer count. (verified both work on 7B llama...not clear yet if one is better vs other). 2 - clean up lr and loss display formatting - lr display was spanning out to 12+ digits which isn't that informative, and was wrapped in list format. This PR rounds it to max of 8 digits precision and removes the []'s that were around the lr rate display. (note this is purely UI...the full float precision is still used in actual lr calcs). 3 - clean up loss display - rounds the loss display to 4 digits precision to make it more readable and informative. previously: <img width="1198" alt="Screenshot 2024-02-16 at 2 33 34 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77733af0-42db-4fab-a047-fccc7d404278"> Now: <img width="1063" alt="Screenshot 2024-02-16 at 2 51 53 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/4eb75b98-67f4-41ec-83d8-dd84a0e8b29e">

Summary: PR implements an unfied config manager. - Command line args and toml file args are now unified. - Defaults can be loaded from either. options like `training.batchsize` will be available as `config.training.batchsize` where `config` is a config manager object. Test Plan: Test Plan: ============================= test session starts ============================== platform linux -- Python 3.10.13, pytest-8.0.1, pluggy-1.4.0 -- /home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache rootdir: /data/users/gnadathur/a/torchtrain configfile: pyproject.toml plugins: cov-4.1.0 collecting ... collected 5 items test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [ 20%] test/test_job_config.py::TestJobConfig::test_command_line_args_with_override PASSED [ 40%] test/test_job_config.py::TestJobConfig::test_job_config_file PASSED [ 60%] test/test_job_config.py::TestJobConfig::test_job_config_file_with_override PASSED [ 80%] test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist PASSED [100%] ---------- coverage: platform linux, python 3.10.13-final-0 ---------- Coverage XML written to file coverage.xml ============================= slowest 20 durations ============================= 0.01s call test/test_job_config.py::TestJobConfig::test_job_config_file_with_override 0.00s call test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s call test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s call test/test_job_config.py::TestJobConfig::test_command_line_args_with_override 0.00s call test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s setup test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s teardown test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s setup test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s setup test/test_job_config.py::TestJobConfig::test_command_line_args_with_override 0.00s teardown test/test_job_config.py::TestJobConfig::test_command_line_args_with_override 0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file_with_override 0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_config_file_with_override ============================== 5 passed in 0.10s =============================== Reviewers: Subscribers: Tasks: Tags: Co-authored-by: gnadathur <gnadathur@devgpu051.cln3.facebook.com>

Add the linter back using a different changed-files plugin which doesn't have permission issues on pytorch/ org. Also change the linter job to use py 3.10 to match our unit test runner.

For now this literally just runs `NGPU=4 ./run_llama_train.sh` but I verified at least it catches problems. As a follow up, we should integrate mgpu test infra from pytorch and set up actual unit tests to run in this job. We should probably also keep testing the run_llama_train.sh script, and add other combinations of 2D parallelism to ensure they all keep working. <img width="2120" alt="image" src="https://github.com/pytorch/torchtrain/assets/4984825/2c235e9a-04ed-4f2d-9915-67de39d78e1c">

mostly testing if new repo works or not

as titled, move the config files to the root folder, where it decouples with the torchtrain package build, and allow easier navigations

@tianyu-l

…olumnar display to show both, show avg iter & data loading times at end of training (pytorch#87) This PR adds basic perf timing and display for 'per iter' and 'final iter average' display. (in part based on Andrew's comment about having to open the trace to compare iter timing). 1. tracking list is housed in TrainState, but I do not save it as part of the state dict as I view this as useful but not saveable info. 2. iter times are tracked after dataloading is done each iter and after optimizer step. The idea is to make this timing expressly the model training iter (not data loading or post iter other metrics calcs). 3. 'time' is now displayed at each iter along with the usual loss and lr. 4. at the end of training, assuming more than 3 iters run, then the average iter time is calculated by igoring the first three iters (consider these as warmup esp as cudaCacheAllocator gets warmed up) and displayed. 5. based on @tianyu-l feedback: I have added data loading times as well. I used the same timeit.default_timer() from timeit to be consistent. (cpu side so no synch's needed :) 6 - after fiddling with printf width formatting options, added beautiful aligned columnar display for the per iter updates: Now: <img width="1282" alt="Screenshot 2024-02-26 at 9 39 25 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/9ee2ea7b-5c28-4d41-ba91-d4176c64fc66"> before: <img width="1282" alt="Screenshot 2024-02-26 at 8 39 46 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/37cbfa20-7f1d-4d94-be94-3505ef4498c0">

Summary: Summary: Follow up on config unification, options not available in config file are picked from command line defaults. Test Plan: ============================= test session starts ============================== platform linux -- Python 3.10.13, pytest-8.0.1, pluggy-1.4.0 -- /home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache rootdir: /data/users/gnadathur/a/torchtrain configfile: pyproject.toml plugins: cov-4.1.0 collecting ... collected 3 items test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [ 33%] test/test_job_config.py::TestJobConfig::test_job_config_file PASSED [ 66%] test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist PASSED [100%] ---------- coverage: platform linux, python 3.10.13-final-0 ---------- Coverage XML written to file coverage.xml ============================= slowest 20 durations ============================= 0.00s call test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s call test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s call test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s setup test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s teardown test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s setup test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist ============================== 3 passed in 0.06s =============================== Test Plan: Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

ghstack-source-id: 38cbc277e2a177bc0baf35450a661835b97a7f22 Pull Request resolved: pytorch#92

…g on slurm (pytorch#93) This PR adds the ability to do colored console outputs in order to highlight the training data outputs. It also adds a check to not use this color formatting on slurm, where it will add 33= instead of the color if not avoided. Note that I've just added some color to highlight the main training data. Users that fork/clone can use it to enhance their outputs as desired. <img width="1372" alt="Screenshot 2024-02-26 at 10 20 15 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/44849821-1677-40bf-896c-39344cd661d6"> Note that on slurm it remains plain: <img width="847" alt="Screenshot 2024-02-26 at 10 46 24 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/172eaa58-4f5c-48f5-8ec1-bc349e3e82f2"> if you dont' check this, then it would otherwise look like this (this does not happen with this PR, just showing if we didn't check and credit to Yifu for noting this would be an issue): <img width="847" alt="Screenshot 2024-02-26 at 10 39 23 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/4a87fb9a-dd3a-417c-a29e-286ded069358">

@awgu

this PR updates the GPU metrics to labelling as GiB - we were calculating GiB but calling it GB. (credit to @awgu for flagging this - issue pytorch#94) function names and member vars in metrics.py have been updated to _gib instead of _gb for clarity, and the logging output now labels as GiB: <img width="851" alt="Screenshot 2024-02-27 at 11 28 23 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/85eb260a-77e9-4c49-be8a-b1aaa10dc3e2">

ghstack-source-id: 7dc4a80cf9c32f4dca3d00bcef019d256bdf58f7 Pull Request resolved: pytorch#96

Enable libUV for torchtrain. Test: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] ***************************************** W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] ***************************************** [rank0]:2024-02-28 09:12:04,581 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank1]:2024-02-28 09:12:04,708 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-28 09:12:05,647 - root - INFO - Building llama [rank0]:2024-02-28 09:12:05,655 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-28 09:12:05,655 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-02-28 09:12:07,299 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-02-28 09:12:07,299 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-02-28 09:12:07,565 - root - INFO - Model fully initialized via reset_params [rank0]:2024-02-28 09:12:07,566 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-02-28 09:12:07,566 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-02-28 09:12:07,567 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-02-28 09:12:08,769 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-28 09:12:08,770 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-28 09:12:08,770 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240228-0912. [rank0]:2024-02-28 09:12:08,977 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-02-28 09:12:10,956 - root - INFO - �[36mstep: 1 �[32mloss: 10.9229 �[39miter: �[34m 1.9386�[39m data: �[34m0.0368 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-02-28 09:12:11,045 - root - INFO - �[36mstep: 2 �[32mloss: 10.8673 �[39miter: �[34m 0.0562�[39m data: �[34m0.0316 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-02-28 09:12:11,130 - root - INFO - �[36mstep: 3 �[32mloss: 10.7145 �[39miter: �[34m 0.0523�[39m data: �[34m0.0322 �[39mlr: �[33m0.0008�[39m [rank0]:2024-02-28 09:12:11,219 - root - INFO - �[36mstep: 4 �[32mloss: 10.5038 �[39miter: �[34m 0.0559�[39m data: �[34m0.0319 �[39mlr: �[33m0.0007�[39m [rank0]:2024-02-28 09:12:11,304 - root - INFO - �[36mstep: 5 �[32mloss: 10.2228 �[39miter: �[34m 0.0537�[39m data: �[34m0.031 �[39mlr: �[33m0.0006�[39m [rank0]:2024-02-28 09:12:11,391 - root - INFO - �[36mstep: 6 �[32mloss: 9.9677 �[39miter: �[34m 0.0562�[39m data: �[34m0.0302 �[39mlr: �[33m0.0005�[39m [rank0]:2024-02-28 09:12:11,478 - root - INFO - �[36mstep: 7 �[32mloss: 9.7762 �[39miter: �[34m 0.0544�[39m data: �[34m0.0317 �[39mlr: �[33m0.0004�[39m [rank0]:2024-02-28 09:12:11,676 - root - INFO - �[36mstep: 8 �[32mloss: 9.4359 �[39miter: �[34m 0.0509�[39m data: �[34m0.0322 �[39mlr: �[33m0.0003�[39m [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-28 09:12:11,813 - root - INFO - �[36mstep: 9 �[32mloss: 9.2326 �[39miter: �[34m 0.1007�[39m data: �[34m0.0321 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-28 09:12:12,195 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-02-28 09:12:12,207 - root - INFO - �[36mstep: 10 �[32mloss: 9.1641 �[39miter: �[34m 0.0971�[39m data: �[34m0.031 �[39mlr: �[33m0.0001�[39m [rank0]:2024-02-28 09:12:12,207 - root - INFO - Average iter time: 0.0670 seconds [rank0]:2024-02-28 09:12:12,207 - root - INFO - Average data load time: 0.0314 seconds [rank0]:2024-02-28 09:12:12,208 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` --------- Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

as titled, we don't want to allow steps == -1 case as it would blow up the lr scheduler

Add 7b config and adjust options to be more realistic didn't add this to the train scripts as default as it's expensive to init, whoever use it can adjust it accordingly

ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40 Pull Request resolved: pytorch#97

Summary: Adding a description field, useful for integration tests to describe the test. Test Plan: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** [rank1]:2024-02-29 17:05:04,269 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:04,510 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:05,327 - root - INFO - Starting job: debug training [rank0]:2024-02-29 17:05:05,327 - root - INFO - Building llama [rank0]:2024-02-29 17:05:05,335 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-29 17:05:05,335 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-02-29 17:05:06,782 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-02-29 17:05:06,782 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-02-29 17:05:07,347 - root - INFO - Model fully initialized via reset_params [rank0]:2024-02-29 17:05:07,349 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-02-29 17:05:07,349 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-02-29 17:05:07,349 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-02-29 17:05:08,375 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-29 17:05:08,376 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-29 17:05:08,376 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240229-1705. [rank0]:2024-02-29 17:05:08,610 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-02-29 17:05:10,570 - root - INFO - �[36mstep: 1 �[32mloss: 10.9183 �[39miter: �[34m 1.9258�[39m data: �[34m0.0303 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-02-29 17:05:10,653 - root - INFO - �[36mstep: 2 �[32mloss: 10.8347 �[39miter: �[34m 0.0487�[39m data: �[34m0.0336 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-02-29 17:05:10,733 - root - INFO - �[36mstep: 3 �[32mloss: 10.6861 �[39miter: �[34m 0.045�[39m data: �[34m0.0334 �[39mlr: �[33m0.0008�[39m [rank0]:2024-02-29 17:05:10,812 - root - INFO - �[36mstep: 4 �[32mloss: 10.4672 �[39miter: �[34m 0.0453�[39m data: �[34m0.0336 �[39mlr: �[33m0.0007�[39m [rank0]:2024-02-29 17:05:10,893 - root - INFO - �[36mstep: 5 �[32mloss: 10.2154 �[39miter: �[34m 0.0466�[39m data: �[34m0.033 �[39mlr: �[33m0.0006�[39m [rank0]:2024-02-29 17:05:10,975 - root - INFO - �[36mstep: 6 �[32mloss: 9.9573 �[39miter: �[34m 0.0496�[39m data: �[34m0.0314 �[39mlr: �[33m0.0005�[39m [rank0]:2024-02-29 17:05:11,056 - root - INFO - �[36mstep: 7 �[32mloss: 9.7627 �[39miter: �[34m 0.0486�[39m data: �[34m0.0321 �[39mlr: �[33m0.0004�[39m [rank0]:2024-02-29 17:05:11,201 - root - INFO - �[36mstep: 8 �[32mloss: 9.437 �[39miter: �[34m 0.0457�[39m data: �[34m0.0333 �[39mlr: �[33m0.0003�[39m [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-29 17:05:11,317 - root - INFO - �[36mstep: 9 �[32mloss: 9.2446 �[39miter: �[34m 0.0794�[39m data: �[34m0.0324 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-29 17:05:11,748 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-02-29 17:05:11,762 - root - INFO - �[36mstep: 10 �[32mloss: 9.1772 �[39miter: �[34m 0.0893�[39m data: �[34m0.0324 �[39mlr: �[33m0.0001�[39m [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average iter time: 0.0578 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average data load time: 0.0326 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9 Pull Request resolved: pytorch#105

``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] ***************************************** W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] ***************************************** [rank0]:2024-03-04 17:01:28,834 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank1]:2024-03-04 17:01:28,857 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-03-04 17:01:29,712 - root - INFO - Starting job: debug training [rank0]:2024-03-04 17:01:29,712 - root - INFO - Building llama [rank0]:2024-03-04 17:01:29,719 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-04 17:01:29,719 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-03-04 17:01:31,187 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-03-04 17:01:31,188 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-03-04 17:01:31,346 - root - INFO - Model fully initialized via reset_params [rank0]:2024-03-04 17:01:31,346 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-04 17:01:31,347 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-04 17:01:31,347 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-03-04 17:01:32,502 - root - INFO - Applied FSDP to the model... [rank0]:2024-03-04 17:01:32,503 - root - INFO - Gradient scaling not enabled. [rank0]:2024-03-04 17:01:32,504 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240304-1701. [rank0]:2024-03-04 17:01:32,901 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-03-04 17:01:34,806 - root - INFO - �[36mstep: 1 �[32mloss: 10.8424 �[39miter: �[34m 1.8688�[39m data: �[34m0.0316 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-03-04 17:01:34,891 - root - INFO - �[36mstep: 2 �[32mloss: 10.7581 �[39miter: �[34m 0.0476�[39m data: �[34m0.0357 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-03-04 17:01:34,970 - root - INFO - �[36mstep: 3 �[32mloss: 10.6239 �[39miter: �[34m 0.045�[39m data: �[34m0.0333 �[39mlr: �[33m0.0008�[39m [rank0]:2024-03-04 17:01:35,048 - root - INFO - �[36mstep: 4 �[32mloss: 10.4163 �[39miter: �[34m 0.0455�[39m data: �[34m0.0323 �[39mlr: �[33m0.0007�[39m [rank0]:2024-03-04 17:01:35,127 - root - INFO - �[36mstep: 5 �[32mloss: 10.1529 �[39miter: �[34m 0.0459�[39m data: �[34m0.032 �[39mlr: �[33m0.0006�[39m [rank0]:2024-03-04 17:01:35,206 - root - INFO - �[36mstep: 6 �[32mloss: 9.8899 �[39miter: �[34m 0.0468�[39m data: �[34m0.0311 �[39mlr: �[33m0.0005�[39m [rank0]:2024-03-04 17:01:35,284 - root - INFO - �[36mstep: 7 �[32mloss: 9.7204 �[39miter: �[34m 0.0461�[39m data: �[34m0.0312 �[39mlr: �[33m0.0004�[39m [rank0]:2024-03-04 17:01:35,425 - root - INFO - �[36mstep: 8 �[32mloss: 9.3757 �[39miter: �[34m 0.0457�[39m data: �[34m0.0319 �[39mlr: �[33m0.0003�[39m [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-03-04 17:01:35,537 - root - INFO - �[36mstep: 9 �[32mloss: 9.1883 �[39miter: �[34m 0.0762�[39m data: �[34m0.0318 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-03-04 17:01:35,958 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-03-04 17:01:35,971 - root - INFO - �[36mstep: 10 �[32mloss: 9.1212 �[39miter: �[34m 0.0808�[39m data: �[34m0.0319 �[39mlr: �[33m0.0001�[39m [rank0]:2024-03-04 17:01:35,972 - root - INFO - Average iter time: 0.0553 seconds [rank0]:2024-03-04 17:01:35,972 - root - INFO - Average data load time: 0.0317 seconds [rank0]:2024-03-04 17:01:35,972 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

This PR enables meta_init functionality to avoid OOM'ing on cpu for larger models. The core functionality is in meta_init.py, and a few changes in parallelization and train.py. Key items: 1 - this is largely the same as the earlier PR I had for meta_init, but I did a new one b/c faster than reworking it with all the interim changes. 2 - to address feedback in previous PR: a - why do we need meta_init.py, can't we just do: ~~~ with torch.device("meta"): model = Model.from_args(...) ~~~ Unfortunately this does not work b/c the rope embeddings are treated differently (buffer) and thus the simple lambda call from param_init_fn in FSDP (lambda module: module.to_device('cuda') ) will not invoke or move the rope embeddings and the model will fail on first forward. This issue relates to the nn.embeddings not being moved, and that the device is referenced in the forward pass for the current rope class. Have opened pytorch#110 to track this and investigate while not holding up meta init that is working from landing. b - per earlier feedback - meta init is now 'not optional' but simply the default. This should ensure all models leverage it and ensure we aren't missing things for future meta_init aspects. 3 - misc change - I switched the model_params to just do the normal all params count instead of 'unique params' b/c it does not mesh with what people perceive model size as. Testing: tested both debugmodel and 26B model with and without meta init to confirm same loss curves. Note for future reference - if you get a bad init (meta init failure) you will simply not train (loss is same every iter). If you fail to call reset params after FSDP, then you will train (b/c we default to torch.randn_like) but your starting loss will be 5x+ higher (telling you that you have not properly init'ed the model).

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

as titled, fixes pytorch#286

ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: pytorch#287

Runs the integration test hourly and updates signal badge. Tested on existing integration test. I will update the badge with periodic test signal once workflow has landed in this PR. <img width="516" alt="Screenshot 2024-04-30 at 6 12 00 PM" src="https://github.com/pytorch/torchtitan/assets/1779702/8adaab3d-df18-483d-a39f-5af316b7edbc">

ghstack-source-id: 9daa99020c76fdfe429b6a9ee6d44fd1dd319fc3 Pull Request resolved: pytorch#280

Adds new command ./create_seed_checkpoint.sh which largely reuses code inside train.py to create the model and then save its initial state as a step-0 checkpoint for use with meta-initialization loading flow. ghstack-source-id: 3e1aa9eab847c1f1341f22772ca8ae3688883454 Pull Request resolved: pytorch#172

ghstack-source-id: fa9aaf337b5489d88945f15b65a8ba8cc544ded6 Pull Request resolved: pytorch#295

This appears to be a holdover from a previous way the initialization worked. freqs_cis should already be on gpu device after initialization. ghstack-source-id: 7159320d4ecfb436bd2193277a88c04d136e9ad0 Pull Request resolved: pytorch#298

…int (pytorch#293) Summary: The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.

@bdhirsh

as titled. This could make 1-D and 2-D works with the lastest main build. thanks @bdhirsh for all the fixes! We should figure out why dynamic shape gets turned on as a follow up

ghstack-source-id: bbedad3819ab9ef90b233209c34dd1dbc846b06a Pull Request resolved: pytorch#299

Summary: This PR implements 2 different async checkpoint. The first one is to use DCP.async_save another one is to use pinned memory + a seperate process to avoid GILs issue. ghstack-source-id: 87fb6c28d7bc3e514c0bee7646be5188f1f66bbd Pull Request resolved: pytorch#313

as titled, we can directly specify the rowwise parallel embedding output layouts be shard on sequence dim, so that we don't need the first layer prepare input. Switching to output_layouts = Shard(1) would also trigger reduce_scatter instead of allreduce for embedding layer, which could give some small perf wins

drisspg · 2024-05-10T00:58:43Z

Is 2 saying that in order to have "full" compile you need to set both compile=true and compile_rmsnorm = true

drisspg · 2024-05-10T00:59:26Z

torchtitan/parallelisms/parallelize_llama.py

@@ -229,6 +249,9 @@ def parallelize_llama(model, world_mesh, parallel_dims, job_config: JobConfig):
                reshard_after_forward=reshard_after_forward,
            )
            model.layers[layer_id] = transformer_block
+        if enable_compile_rmsnorm:
+            model.norm = torch.compile(model.norm, dynamic=False)


tianyu-l

I think we should not couple compile-related code with parallelize_llama code.
Maybe we should create a file called compile.py and put everything there. This was not viable when we need to do per TransformerBlock compile due to some restriction (vaguely remember we need to first compile then FSDP wrap?), but now there's no such restriction since we are compiling the whole model.
Would you please also show the comparison between triton kernel and compiled rmsnorm? That would give us a better idea of how much the improvement is.
Let's follow up on if it works with TP, FSDP+TP, PP, 3D, SAC, etc.

tianyu-l · 2024-05-10T01:12:42Z

torchtitan/parallelisms/parallelize_llama.py

@@ -214,12 +214,32 @@ def parallelize_llama(model, world_mesh, parallel_dims, job_config: JobConfig):
            param_dtype=torch.bfloat16, reduce_dtype=torch.float32
        )
        ac_mode = job_config.activation_checkpoint.mode
+        # specifically compile just the RMSNorm layers
+        enable_compile_rmsnorm = job_config.training.compile_rmsnorm


just calling it compile_rmsnorm is good enough I think

I think it's better to write compile_rmsnorm = job_config.model.norm_type == "rmsnorm" and job_config.training.compile_rmsnorm and not job_config.training.compile

tianyu-l · 2024-05-10T01:14:00Z

torchtitan/parallelisms/parallelize_llama.py

@@ -214,12 +214,32 @@ def parallelize_llama(model, world_mesh, parallel_dims, job_config: JobConfig):
            param_dtype=torch.bfloat16, reduce_dtype=torch.float32
        )
        ac_mode = job_config.activation_checkpoint.mode
+        # specifically compile just the RMSNorm layers
+        enable_compile_rmsnorm = job_config.training.compile_rmsnorm
+        if job_config.training.compile and enable_compile_rmsnorm:


IMO there's no need to print this message. It's still compiled, although with the whole model.

tianyu-l · 2024-05-10T01:14:59Z

torchtitan/parallelisms/parallelize_llama.py

+            logger.info(
+                "Entire model is compiled with torch.compile, disabling RMSNorm compilation"
+            )
+            enable_compile_rmsnorm = False


ditto no need to set it here

tianyu-l · 2024-05-10T01:24:29Z

torchtitan/config_manager.py

+        self.parser.add_argument(
+            "--training.compile_rmsnorm",
+            action="store_true",
+            help="Whether to compile the norm layers",


If it's called compile_rmsnorm, we should only compile if the norm_type = "rmsnorm". The help message needs to reflect this.

tianyu-l · 2024-05-10T01:26:25Z

train_configs/debug_model.toml

@@ -39,6 +39,7 @@ tensor_parallel_degree = 1
 pipeline_parallel_degree = 1
 fp8_linear = ""
 compile = false
+compile_rmsnorm = true  # compile setting above should be false to use this


I think we shouldn't enable it everywhere before we understand when it works and when it doens't, e.g. 2D, selective AC, etc.

lessw2020 · 2024-05-10T01:31:02Z

Is 2 saying that in order to have "full" compile you need to set both compile=true and compile_rmsnorm = true

I updated the text to be more specific, but no - if compile = true in the config, then you get full compile including the rmsnorm layers.

wanchaol · 2024-05-10T02:34:23Z

torchtitan/config_manager.py

@@ -212,6 +212,11 @@ def __init__(self):
            action="store_true",
            help="Whether to compile the model",
        )
+        self.parser.add_argument(


IMO we should not add another option. We can just possibly reuse the fused_rmsnorm field in the norm_type, instead of adding a separate option here

wanchaol · 2024-05-10T02:35:45Z

torchtitan/parallelisms/parallelize_llama.py

@@ -229,6 +249,9 @@ def parallelize_llama(model, world_mesh, parallel_dims, job_config: JobConfig):
                reshard_after_forward=reshard_after_forward,
            )
            model.layers[layer_id] = transformer_block
+        if enable_compile_rmsnorm:
+            model.norm = torch.compile(model.norm, dynamic=False)


I think we should also check the fqn, iirc torch.compile by default would prepend a fqn for the compiled model, we should try to get rid of it in some way, otherwise we would not be able to load Llama3 pretrained weights

you are right - it inserts into the fqn (_orig_mod):

[rank0]:Named parameters of the module: [rank0]:Parameter name: attention.wq.weight, Shape: torch.Size([256, 256]) [rank0]:Parameter name: attention.wk.weight, Shape: torch.Size([256, 256]) [rank0]:Parameter name: attention.wv.weight, Shape: torch.Size([256, 256]) [rank0]:Parameter name: attention.wo.weight, Shape: torch.Size([256, 256]) [rank0]:Parameter name: feed_forward.w1.weight, Shape: torch.Size([768, 256]) [rank0]:Parameter name: feed_forward.w2.weight, Shape: torch.Size([256, 768]) [rank0]:Parameter name: feed_forward.w3.weight, Shape: torch.Size([768, 256]) [rank0]:Parameter name: attention_norm._orig_mod.weight, Shape: torch.Size([256]) [rank0]:Parameter name: ffn_norm._orig_mod.weight, Shape: torch.Size([256])

will make a cleanup function

IMO we should probably not make a state_dict hook if that's what u are going to do.

Instead we should compile the norm function directly

distributed_state_dict already has this covered. It will remove the FQN from torch.compile. Let me know if you do see some issues.

An alternative, if you are still seeing this error, is to call module.compile instead of torch.compile

fegin · 2024-05-11T04:23:07Z

torchtitan/parallelisms/parallelize_llama.py

+                transformer_block.attention_norm = torch.compile(
+                    transformer_block.attention_norm, dynamic=False
+                )
+                transformer_block.ffn_norm = torch.compile(
+                    transformer_block.ffn_norm, dynamic=False
+                )


This is actually not correct. If AC is enabled, you will see 2 extra submodules in the AC module, attention_norm, ffn_norm. And these 2 submodules will never be used in the forward but only be used in the state_dict().

tianyu-l · 2024-08-21T04:29:35Z

close as we removed the feature in #535

wanchaol and others added 30 commits February 13, 2024 14:10

check in tokenizer.model for ease of dev setup (pytorch#59)

054f088

it's a small thing and can be download from OSS, we can just check in

add TensorBoard logging with loss and wps

ad69e62

ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4 Pull Request resolved: pytorch#57

add memory metrics to TensorBoard

a4663b1

ghstack-source-id: da7e02b1c2f21a7471ce1dda8bd4d0ee888ad9ac Pull Request resolved: pytorch#60

modify data split to use HF api

2daf53f

ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939 Pull Request resolved: pytorch#65

add bunch of cleanups and design principle section (pytorch#71)

8097c26

delete the linter to see if re-adding it helps (pytorch#80)

28f431f

Whc/add linter (pytorch#81)

bccad90

Add the linter back using a different changed-files plugin which doesn't have permission issues on pytorch/ org. Also change the linter job to use py 3.10 to match our unit test runner.

update readme (pytorch#74)

468ce8f

mostly testing if new repo works or not

move config folder to root and adjust options (pytorch#83)

3fce6bb

as titled, move the config files to the root folder, where it decouples with the torchtrain package build, and allow easier navigations

support infinite loop over alpaca dataset

325951f

ghstack-source-id: 38cbc277e2a177bc0baf35450a661835b97a7f22 Pull Request resolved: pytorch#92

improve TensorBoard instructions in README

4c03475

ghstack-source-id: 7dc4a80cf9c32f4dca3d00bcef019d256bdf58f7 Pull Request resolved: pytorch#96

use warmup steps for lr scheduler, ban steps == -1 (pytorch#99)

e60c573

as titled, we don't want to allow steps == -1 case as it would blow up the lr scheduler

Add llama 7B config (pytorch#100)

900b215

Add 7b config and adjust options to be more realistic didn't add this to the train scripts as default as it's expensive to init, whoever use it can adjust it accordingly

add selective activation checkpointing

6e87471

ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40 Pull Request resolved: pytorch#97

fix 2D parallel crash caused by all-reduce on 2D world_mesh

42f8907

ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9 Pull Request resolved: pytorch#105

Fix feedback from PR 111 (pytorch#113)

5f0eaea

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

wanchaol and others added 13 commits April 30, 2024 15:22

fix 3d mesh order (pytorch#288)

e7f2d28

as titled, fixes pytorch#286

unify data loading from HF and from disk

4e5ffaf

ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: pytorch#287

exclude embedding in MFU computation

4d8c245

ghstack-source-id: 9daa99020c76fdfe429b6a9ee6d44fd1dd319fc3 Pull Request resolved: pytorch#280

remove unnecessary install of torchtitan

1a6caf2

ghstack-source-id: fa9aaf337b5489d88945f15b65a8ba8cc544ded6 Pull Request resolved: pytorch#295

Remove unnecessary .to() inside model forward

787a571

This appears to be a holdover from a previous way the initialization worked. freqs_cis should already be on gpu device after initialization. ghstack-source-id: 7159320d4ecfb436bd2193277a88c04d136e9ad0 Pull Request resolved: pytorch#298

Fix the incorrect step log for profiler after resuming from a checkpo…

695bd01

…int (pytorch#293) Summary: The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.

turn off dynamic shape for torch.compile (pytorch#297)

143b586

as titled. This could make 1-D and 2-D works with the lastest main build. thanks @bdhirsh for all the fixes! We should figure out why dynamic shape gets turned on as a follow up

Renamed bsz to bs for consistency; removed dead code

f72a2a0

ghstack-source-id: bbedad3819ab9ef90b233209c34dd1dbc846b06a Pull Request resolved: pytorch#299

selective compilation - norm layers only

a08a70b

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 10, 2024

lessw2020 added 2 commits May 9, 2024 17:43

lint

02cc5c4

update config mgr and other tomls

f249e26

lessw2020 requested review from wanchaol, tianyu-l and wconstab May 10, 2024 00:50

drisspg closed this May 10, 2024

drisspg reopened this May 10, 2024

drisspg reviewed May 10, 2024

View reviewed changes

tianyu-l reviewed May 10, 2024

View reviewed changes

wanchaol reviewed May 10, 2024

View reviewed changes

fegin reviewed May 11, 2024

View reviewed changes

tianyu-l force-pushed the main branch from 88ace96 to 81c555f Compare August 16, 2024 21:00

tianyu-l closed this Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

selective compilation - norm layers only #320

selective compilation - norm layers only #320

lessw2020 commented May 10, 2024 •

edited

Loading

drisspg commented May 10, 2024

drisspg May 10, 2024

tianyu-l left a comment

tianyu-l May 10, 2024

tianyu-l May 10, 2024

tianyu-l May 10, 2024

tianyu-l May 10, 2024

tianyu-l May 10, 2024

tianyu-l May 10, 2024

lessw2020 commented May 10, 2024

wanchaol May 10, 2024

wanchaol May 10, 2024

lessw2020 May 10, 2024 •

edited

Loading

lessw2020 May 10, 2024

wanchaol May 10, 2024

fegin May 10, 2024

fegin May 10, 2024

fegin May 11, 2024 •

edited

Loading

tianyu-l commented Aug 21, 2024

selective compilation - norm layers only #320

selective compilation - norm layers only #320

Conversation

lessw2020 commented May 10, 2024 • edited Loading

drisspg commented May 10, 2024

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lessw2020 commented May 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lessw2020 May 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fegin May 11, 2024 • edited Loading

Choose a reason for hiding this comment

tianyu-l commented Aug 21, 2024

lessw2020 commented May 10, 2024 •

edited

Loading

lessw2020 May 10, 2024 •

edited

Loading

fegin May 11, 2024 •

edited

Loading