-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] IndexError / Runtime Error with torch.nn.TransformerEncoder
#1795
Comments
I was able to get your model running. You need to make your dataset type import os
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
import deepspeed
class BoringDataset(Dataset):
def __init__(self) -> None:
super().__init__()
self.data = torch.randn(200, 128, 1024).half()
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return self.data[index, :, :], self.data[index, :, :]
train_set = BoringDataset()
train_loader = DataLoader(train_set, batch_size=32)
ds_config = {
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 1,
"steps_per_print": 1,
"zero_optimization": {
"stage": 3,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_param_persitence_threshold": 1e5,
"stage3_prefetch_bucket_size": 5e7,
"contiguous_gradients": True,
"overlap_comm": True,
"reduce_bucket_size": 90000000,
"sub_group_size": 1e8,
"offload_param": {"device": "cpu", "pin_memory": True},
"offload_optimizer": {"device": "cpu", "pin_memory": True},
},
"optimizer": {"type": "Adam", "params": {"lr": 0.001, "betas": [0.9, 0.95]}},
"fp16": {
"enabled": True,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1,
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": True,
"zero_allow_untested_optimizer": False,
"aio": {
"block_size": 1048576,
"queue_depth": 16,
"single_submit": False,
"overlap_events": True,
"thread_count": 2,
},
}
encoder_layer = nn.TransformerEncoderLayer(d_model=1024, nhead=8)
transformer = nn.TransformerEncoder(encoder_layer=encoder_layer, num_layers=10)
model, _, _, _ = deepspeed.initialize(
config=ds_config, model=transformer, model_parameters=transformer.parameters()
)
loss_fn = nn.MSELoss()
model.train()
rank = int(os.getenv("RANK", "0"))
for step, (batch, label) in enumerate(train_loader):
if rank == 0:
print("step:", step)
batch = batch.cuda(rank)
label = label.cuda(rank)
output = model(batch)
model.backward(loss_fn(output, label))
model.step() |
Closing due to inactivity, please reopen if you are unable to run the model with the updated script. |
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
I'm fairly new to DeepSpeed, and I tried deepspeed zero-3 and offload with pytorch transformer.
And I write a simple test code:
But I encountered traceback
RuntimeError: expected scalar type Float but found Half
So, I refer to previous issues but find nothing helps. I guess I may half the input tensor manually. But I get another traceback
IndexError: list index out of range
sorry that I really can hardly find some hints in deepspeed docs or elsewhere, so I post a issue here
DeepSpeed JSON config
ds_report output
Please run
ds_report
to give us details about your setup.Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context
Are you launching your experiment with the
deepspeed
launcher, MPI, or something else?deepspeed launcher, here's my script
Docker context
Are you using a specific docker image that you can share?
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: