Add balanced option for auto device map creation #534

sgugger · 2022-07-19T15:06:46Z

This PR adds an new option to device_map creation, to have something that balances the GPUs when several are available and their combined space is bigger than the model size. This then permits users to handle a batch size greater than 1.

Since there is no downside, this balanced way becomes the new "auto" behavior. The user can still get the old behavior with the "sequential" option and can also use "balanced" for the balanced way (in case auto becomes something different in the future). There is also "blanced_low_0" when we want to minimize the weights on GPU 0 if it's used for generation (cc @stas00 )

@younesbelkada This was something you requested so cc-ing you here.

TODO:

Documentation (once Add more documentation for device maps computations #530 is merged)
Tests

HuggingFaceDocBuilderDev · 2022-07-19T15:10:17Z

The documentation is not available anymore as the PR was closed or merged.

muellerzr

Looks great! The new docs have a separate section for the big modeling API so I'll rebase after this and make sure to include it. Does the main big modeling tutorial need to be updated with this new inclusion?

src/accelerate/utils/modeling.py

sgugger · 2022-07-19T16:15:20Z

Yes I will update it once the other PR is merged :-)

stas00 · 2022-07-19T20:40:53Z

src/accelerate/utils/modeling.py

+    max_memory: Optional[Dict[Union[int, str], Union[int, str]]] = None,
+    no_split_module_classes: Optional[List[str]] = None,
+    dtype: Optional[Union[str, torch.dtype]] = None,
+    low_zero: bool = False,


If it resonates I would rename it to something more self-documenting? e.g. minimize_gpu0_memory or minimize_first_gpu_memory

stas00

Thank you for working on automating this,

FWIW, I had to completely free gpu 0 from any weights to fit a large bs with BLOOM.

my current logic for figuring out the most optimal memory map is this:

def get_max_memory_per_gpu_dict(dtype, model_name):
    """ try to generate the memory map based on what we know about the model and the available hardware """

    # figure out the memory map - the minimum per gpu required to load the model
    n_gpus = torch.cuda.device_count()

    if model_name == "bigscience/bloom" and n_gpus == 8 and torch.cuda.get_device_properties(0).total_memory > 79*2**30:
        # hand crafted optimized memory map for 8x80 setup over BLOOM
        # this works with bs=48
        return {0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB', 4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'}

    try:
        # model_params calculation, as we don't have a model yet to do:
        #model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())

        config = AutoConfig.from_pretrained(model_name)
        h = config.n_embed
        l = config.n_layer
        v = config.vocab_size
        # from https://github.com/bigscience-workshop/bigscience/tree/a3e451498ee8189d2a9dd47be19aa89b0e16cd89/math#model-sizing
        model_params = l*(12*h**2 + 13*h) + v*h + 4*h
    except:
        print(f"The model {model_name} has a broken config file. Please notify the owner")
        raise

    bytes = torch.finfo(dtype).bits / 8
    param_memory_total_in_bytes = model_params * bytes
    # add 5% since weight sizes aren't the same and some GPU may need more memory
    param_memory_per_gpu_in_bytes = int(param_memory_total_in_bytes / n_gpus * 1.05)
    print(f"Estimating {param_memory_per_gpu_in_bytes/2**30:0.2f}GB per gpu for weights")

    # check the real available memory
    # load cuda kernels first and only measure the real free memory after loading (shorter by ~2GB)
    torch.ones(1).cuda()
    max_memory_per_gpu_in_bytes = torch.cuda.mem_get_info(0)[0]
    if max_memory_per_gpu_in_bytes < param_memory_per_gpu_in_bytes:
        raise ValueError(f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes/2**30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes/2**30:0.2f}GB)")

    return {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}

This leads to equal allocation across all GPUs. Ideally it should be reworked to leave the gpu0 as close to empty as possible. So probably first trying to spread the weights across all but the first gpu, while leaving enough memory for activation calculation and temps. And only then assign any remaining weights to the first gpu.

But the problem is that since we want the weights to follow a sequence, and the first weight is typically the word embedding it becomes an issue, since e.g. in the case of BLOOM it's 7.2GB in bf16. So not a good allocation for gpu0.

As you can see for now I'm just making a special case for bloom+8x80 - manually crafted.

sgugger · 2022-07-20T06:18:25Z

This leads to equal allocation across all GPUs. Ideally it should be reworked to leave the gpu0 as close to empty as possible. So probably first trying to spread the weights across all but the first gpu, while leaving enough memory for activation calculation and temps. And only then assign any remaining weights to the first gpu.

That is exactly what the option "balanced_low_0" will do.

muellerzr

Thanks! Looks great!

Only left one doc nit

docs/source/big_modeling.mdx

Co-authored-by: Zachary Mueller <muellerzr@gmail.com>

sgugger requested a review from muellerzr July 19, 2022 15:06

muellerzr approved these changes Jul 19, 2022

View reviewed changes

src/accelerate/utils/modeling.py Show resolved Hide resolved

stas00 reviewed Jul 19, 2022

View reviewed changes

sgugger added 3 commits July 20, 2022 04:24

Add balanced option for auto device map creation

af5e817

More options

e44fe01

Add low0 option

ae1cf35

sgugger force-pushed the balanced_device_map branch from 6a7444b to ae1cf35 Compare July 20, 2022 08:24

sgugger added 3 commits July 20, 2022 04:42

Add documentation

47ddeaf

Add tests

d85770e

Fix tests

49c1c5a

muellerzr approved these changes Jul 20, 2022

View reviewed changes

docs/source/big_modeling.mdx Outdated Show resolved Hide resolved

Update docs/source/big_modeling.mdx

920a93f

Co-authored-by: Zachary Mueller <muellerzr@gmail.com>

sgugger merged commit d6c72bd into main Jul 20, 2022

sgugger deleted the balanced_device_map branch July 20, 2022 15:39

sgugger mentioned this pull request Jul 28, 2022

Add balanced strategies for device_map in from_pretrained huggingface/transformers#18349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add balanced option for auto device map creation #534

Add balanced option for auto device map creation #534

sgugger commented Jul 19, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 19, 2022 •

edited

Loading

muellerzr left a comment

sgugger commented Jul 19, 2022

stas00 Jul 19, 2022 •

edited

Loading

stas00 left a comment •

edited

Loading

sgugger commented Jul 20, 2022

muellerzr left a comment

Add balanced option for auto device map creation #534

Add balanced option for auto device map creation #534

Conversation

sgugger commented Jul 19, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Jul 19, 2022 • edited Loading

muellerzr left a comment

Choose a reason for hiding this comment

sgugger commented Jul 19, 2022

stas00 Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

stas00 left a comment • edited Loading

Choose a reason for hiding this comment

sgugger commented Jul 20, 2022

muellerzr left a comment

Choose a reason for hiding this comment

sgugger commented Jul 19, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 19, 2022 •

edited

Loading

stas00 Jul 19, 2022 •

edited

Loading

stas00 left a comment •

edited

Loading