-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add balanced option for auto device map creation #534
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! The new docs have a separate section for the big modeling API so I'll rebase after this and make sure to include it. Does the main big modeling tutorial need to be updated with this new inclusion?
Yes I will update it once the other PR is merged :-) |
max_memory: Optional[Dict[Union[int, str], Union[int, str]]] = None, | ||
no_split_module_classes: Optional[List[str]] = None, | ||
dtype: Optional[Union[str, torch.dtype]] = None, | ||
low_zero: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it resonates I would rename it to something more self-documenting? e.g. minimize_gpu0_memory
or minimize_first_gpu_memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on automating this,
FWIW, I had to completely free gpu 0 from any weights to fit a large bs with BLOOM.
my current logic for figuring out the most optimal memory map is this:
def get_max_memory_per_gpu_dict(dtype, model_name):
""" try to generate the memory map based on what we know about the model and the available hardware """
# figure out the memory map - the minimum per gpu required to load the model
n_gpus = torch.cuda.device_count()
if model_name == "bigscience/bloom" and n_gpus == 8 and torch.cuda.get_device_properties(0).total_memory > 79*2**30:
# hand crafted optimized memory map for 8x80 setup over BLOOM
# this works with bs=48
return {0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB', 4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'}
try:
# model_params calculation, as we don't have a model yet to do:
#model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())
config = AutoConfig.from_pretrained(model_name)
h = config.n_embed
l = config.n_layer
v = config.vocab_size
# from https://github.com/bigscience-workshop/bigscience/tree/a3e451498ee8189d2a9dd47be19aa89b0e16cd89/math#model-sizing
model_params = l*(12*h**2 + 13*h) + v*h + 4*h
except:
print(f"The model {model_name} has a broken config file. Please notify the owner")
raise
bytes = torch.finfo(dtype).bits / 8
param_memory_total_in_bytes = model_params * bytes
# add 5% since weight sizes aren't the same and some GPU may need more memory
param_memory_per_gpu_in_bytes = int(param_memory_total_in_bytes / n_gpus * 1.05)
print(f"Estimating {param_memory_per_gpu_in_bytes/2**30:0.2f}GB per gpu for weights")
# check the real available memory
# load cuda kernels first and only measure the real free memory after loading (shorter by ~2GB)
torch.ones(1).cuda()
max_memory_per_gpu_in_bytes = torch.cuda.mem_get_info(0)[0]
if max_memory_per_gpu_in_bytes < param_memory_per_gpu_in_bytes:
raise ValueError(f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes/2**30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes/2**30:0.2f}GB)")
return {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}
This leads to equal allocation across all GPUs. Ideally it should be reworked to leave the gpu0 as close to empty as possible. So probably first trying to spread the weights across all but the first gpu, while leaving enough memory for activation calculation and temps. And only then assign any remaining weights to the first gpu.
But the problem is that since we want the weights to follow a sequence, and the first weight is typically the word embedding it becomes an issue, since e.g. in the case of BLOOM it's 7.2GB in bf16. So not a good allocation for gpu0.
As you can see for now I'm just making a special case for bloom+8x80 - manually crafted.
That is exactly what the option |
6a7444b
to
ae1cf35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks great!
Only left one doc nit
Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
This PR adds an new option to
device_map
creation, to have something that balances the GPUs when several are available and their combined space is bigger than the model size. This then permits users to handle a batch size greater than 1.Since there is no downside, this balanced way becomes the new "auto" behavior. The user can still get the old behavior with the "sequential" option and can also use "balanced" for the balanced way (in case auto becomes something different in the future). There is also "blanced_low_0" when we want to minimize the weights on GPU 0 if it's used for generation (cc @stas00 )
@younesbelkada This was something you requested so cc-ing you here.
TODO: