Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add balanced option for auto device map creation #534

Merged
merged 7 commits into from
Jul 20, 2022
Merged

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Jul 19, 2022

This PR adds an new option to device_map creation, to have something that balances the GPUs when several are available and their combined space is bigger than the model size. This then permits users to handle a batch size greater than 1.

Since there is no downside, this balanced way becomes the new "auto" behavior. The user can still get the old behavior with the "sequential" option and can also use "balanced" for the balanced way (in case auto becomes something different in the future). There is also "blanced_low_0" when we want to minimize the weights on GPU 0 if it's used for generation (cc @stas00 )

@younesbelkada This was something you requested so cc-ing you here.

TODO:

@sgugger sgugger requested a review from muellerzr July 19, 2022 15:06
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 19, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! The new docs have a separate section for the big modeling API so I'll rebase after this and make sure to include it. Does the main big modeling tutorial need to be updated with this new inclusion?

src/accelerate/utils/modeling.py Show resolved Hide resolved
@sgugger
Copy link
Collaborator Author

sgugger commented Jul 19, 2022

Yes I will update it once the other PR is merged :-)

max_memory: Optional[Dict[Union[int, str], Union[int, str]]] = None,
no_split_module_classes: Optional[List[str]] = None,
dtype: Optional[Union[str, torch.dtype]] = None,
low_zero: bool = False,
Copy link
Contributor

@stas00 stas00 Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it resonates I would rename it to something more self-documenting? e.g. minimize_gpu0_memory or minimize_first_gpu_memory

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on automating this,

FWIW, I had to completely free gpu 0 from any weights to fit a large bs with BLOOM.

my current logic for figuring out the most optimal memory map is this:

def get_max_memory_per_gpu_dict(dtype, model_name):
    """ try to generate the memory map based on what we know about the model and the available hardware """

    # figure out the memory map - the minimum per gpu required to load the model
    n_gpus = torch.cuda.device_count()

    if model_name == "bigscience/bloom" and n_gpus == 8 and torch.cuda.get_device_properties(0).total_memory > 79*2**30:
        # hand crafted optimized memory map for 8x80 setup over BLOOM
        # this works with bs=48
        return {0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB', 4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'}

    try:
        # model_params calculation, as we don't have a model yet to do:
        #model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())

        config = AutoConfig.from_pretrained(model_name)
        h = config.n_embed
        l = config.n_layer
        v = config.vocab_size
        # from https://github.com/bigscience-workshop/bigscience/tree/a3e451498ee8189d2a9dd47be19aa89b0e16cd89/math#model-sizing
        model_params = l*(12*h**2 + 13*h) + v*h + 4*h
    except:
        print(f"The model {model_name} has a broken config file. Please notify the owner")
        raise

    bytes = torch.finfo(dtype).bits / 8
    param_memory_total_in_bytes = model_params * bytes
    # add 5% since weight sizes aren't the same and some GPU may need more memory
    param_memory_per_gpu_in_bytes = int(param_memory_total_in_bytes / n_gpus * 1.05)
    print(f"Estimating {param_memory_per_gpu_in_bytes/2**30:0.2f}GB per gpu for weights")

    # check the real available memory
    # load cuda kernels first and only measure the real free memory after loading (shorter by ~2GB)
    torch.ones(1).cuda()
    max_memory_per_gpu_in_bytes = torch.cuda.mem_get_info(0)[0]
    if max_memory_per_gpu_in_bytes < param_memory_per_gpu_in_bytes:
        raise ValueError(f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes/2**30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes/2**30:0.2f}GB)")

    return {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}

This leads to equal allocation across all GPUs. Ideally it should be reworked to leave the gpu0 as close to empty as possible. So probably first trying to spread the weights across all but the first gpu, while leaving enough memory for activation calculation and temps. And only then assign any remaining weights to the first gpu.

But the problem is that since we want the weights to follow a sequence, and the first weight is typically the word embedding it becomes an issue, since e.g. in the case of BLOOM it's 7.2GB in bf16. So not a good allocation for gpu0.

As you can see for now I'm just making a special case for bloom+8x80 - manually crafted.

@sgugger
Copy link
Collaborator Author

sgugger commented Jul 20, 2022

This leads to equal allocation across all GPUs. Ideally it should be reworked to leave the gpu0 as close to empty as possible. So probably first trying to spread the weights across all but the first gpu, while leaving enough memory for activation calculation and temps. And only then assign any remaining weights to the first gpu.

That is exactly what the option "balanced_low_0" will do.

@sgugger sgugger force-pushed the balanced_device_map branch from 6a7444b to ae1cf35 Compare July 20, 2022 08:24
Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks great!

Only left one doc nit

docs/source/big_modeling.mdx Outdated Show resolved Hide resolved
Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants