-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: aquila-7B OOM #334
Comments
Same issue here:
My system information:
|
我们工程师正在排查这个问题 |
您从哪儿下载的模型文件? |
是用了这里的代码 |
1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。 |
我在24GB的A5000上运行,也是莫名退出,连OOM错误都不报 |
可以给下执行脚本吗 |
也是在1.7.1版吗 |
没注意版本,就是前天从github上打包下载的flagai的整个zip文件 |
代码从这里copy的:https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila#3-%E6%8E%A8%E7%90%86inference import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.data.tokenizer import Tokenizer
import bminf
state_dict = "./checkpoints_in/"
model_name = 'aquila-7b' # 'aquila-33b'
loader = AutoLoader(
"lm",
model_dir=state_dict,
model_name=model_name,
use_cache=True)
model = loader.get_model()
tokenizer = loader.get_tokenizer()
model.eval()
model.half()
model.cuda()
predictor = Predictor(model, tokenizer)
text = "北京在哪儿?"
text = f'{text}'
print(f"text is {text}")
with torch.no_grad():
out = predictor.predict_generate_randomsample(text, out_max_length=200, temperature=0)
print(f"pred is {out}") 版本: torch 2.0.1+cu118
flagai 1.7.1
bminf 2.0.1 另外将 耗尽的是CPU RAM,不是GPU RAM。 |
啊?!那需要多少CPU内存? |
按照这儿第三步推理的例子运行,还是会出现OOM的问题,40GB内存,V100显卡。 https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila#3-%E6%8E%A8%E7%90%86inference |
wsl2给了50g内存和64g交换空间 |
看起来AquilaChat也有同样的问题。 复现环境: python3 -m venv .env
source .env/bin/activate
pip install -i https://mirrors.cloud.tencent.com/pypi/simple flagai
pip install -i https://mirrors.cloud.tencent.com/pypi/simple bminf
# 修正_six不存在的问题: from torch._six import inf 替换为 from torch import inf。
vim /home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/mpu/grads.py 有没有可能是依赖包版本问题?官方能否给一个requirements.txt? $ pip freeze
absl-py==1.4.0
aiohttp==3.8.4
aiosignal==1.3.1
antlr4-python3-runtime==4.9.3
async-timeout==4.0.2
attrs==23.1.0
bminf==2.0.1
boto3==1.21.42
botocore==1.24.46
cachetools==5.3.1
certifi==2023.5.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.4
colorama==0.4.6
cpm-kernels==1.0.11
datasets==2.0.0
diffusers==0.7.2
dill==0.3.6
einops==0.3.0
filelock==3.12.1
flagai==1.7.1
frozenlist==1.3.3
fsspec==2023.6.0
ftfy==6.1.1
google-auth==2.19.1
google-auth-oauthlib==0.4.6
grpcio==1.54.2
huggingface-hub==0.15.1
idna==3.4
importlib-metadata==6.6.0
jieba==0.42.1
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
lit==16.0.5.post0
lxml==4.9.2
Markdown==3.4.3
MarkupSafe==2.1.3
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
networkx==3.1
nltk==3.6.7
numpy==1.24.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
omegaconf==2.3.0
packaging==23.1
pandas==1.3.5
Pillow==9.5.0
portalocker==2.7.0
protobuf==3.19.6
pyarrow==12.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pyDeprecate==0.3.2
python-dateutil==2.8.2
pytorch-lightning==1.6.5
pytz==2023.3
PyYAML==6.0
regex==2023.6.3
requests==2.31.0
requests-oauthlib==1.3.1
responses==0.18.0
rouge-score==0.1.2
rsa==4.9
s3transfer==0.5.2
sacrebleu==2.3.1
scikit-learn==1.0.2
scipy==1.10.1
sentencepiece==0.1.96
six==1.16.0
sympy==1.12
tabulate==0.9.0
taming-transformers-rom1504==0.0.6
tensorboard==2.9.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
threadpoolctl==3.1.0
tokenizers==0.12.1
torch==2.0.1
torchmetrics==0.11.4
torchvision==0.15.2
tqdm==4.65.0
transformers==4.20.1
triton==2.0.0
typing-extensions==4.6.3
urllib3==1.26.16
wcwidth==0.2.6
Werkzeug==2.3.6
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0 |
我试成功了:推理的代码中,加一个device="cuda"的参数,模型会直接加载到GPU(之前是先加载到CPU,我也不知道为啥啊),加载完后,显存占用28GB,清理缓存后,16GB。7b模型。 |
感谢。 OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting
max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 羊驼系7B是没有问题的。 |
可以先清理下cuda cache。 |
使用测试脚本部署成服务,每调用一次增加显存,几次之后就回出现oom |
请问用的是flagai哪个版本? |
@ftgreat
服务代码: import asyncio os.environ['CUDA_VISIBLE_DEVICES'] = '1' state_dict = "./checkpoints_in" loader = AutoLoader( model.eval() predictor = Predictor(model, tokenizer) def default_dump(obj): async def main_logic(websocket, path): async def start_server(): if name == "main": 其中引用的方法的return改成了yield 每次增加1G左右显存 |
我觉得predict部分需要 no_grad 包一下,不然会增加显存。 |
好的,谢啦,我把方法加了@torch.no_grad()注解,不会增加了 |
可以试试flagai 1.7.2 ,内存32G,显存16G(包括模型+一条2048tokens) |
感谢回复! 升级到1.7.2以后,RTX 3090还是会报GPU OOM错误。 [2023-06-14 00:46:31,934] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
******************** lm aquilachat-7b
Traceback (most recent call last):
File "chat.py", line 10, in <module>
loader = AutoLoader(
File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/auto_model/auto_loader.py", line 216, in __init__
self.model = getattr(LazyImport(self.model_name[0]),
File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 184, in from_pretrain
return load_local(checkpoint_path, only_download_config=only_download_config)
File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 116, in load_local
model.to(device)
File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 23.22 GiB already allocated; 169.31 MiB free; 23.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 使用代码: import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.model.predictor.aquila import aquila_generate
state_dict = "./checkpoints_in"
model_name = 'aquilachat-7b'
loader = AutoLoader(
"lm",
model_dir=state_dict,
model_name=model_name,
use_cache=True,
device='cuda')
model = loader.get_model()
tokenizer = loader.get_tokenizer()
cache_dir = os.path.join(state_dict, model_name)
model.eval()
model.half()
model.cuda()
predictor = Predictor(model, tokenizer)
text = "北京为什么是中国的首都?"
def pack_obj(text):
obj = dict()
obj['id'] = 'demo'
obj['conversations'] = []
human = dict()
human['from'] = 'human'
human['value'] = text
obj['conversations'].append(human)
# dummy bot
bot = dict()
bot['from'] = 'gpt'
bot['value'] = ''
obj['conversations'].append(bot)
obj['instruction'] = ''
return obj
def delete_last_bot_end_singal(convo_obj):
conversations = convo_obj['conversations']
assert len(conversations) > 0 and len(conversations) % 2 == 0
assert conversations[0]['from'] == 'human'
last_bot = conversations[len(conversations)-1]
assert last_bot['from'] == 'gpt'
## from _add_speaker_and_signal
END_SIGNAL = "\n"
len_end_singal = len(END_SIGNAL)
len_last_bot_value = len(last_bot['value'])
last_bot['value'] = last_bot['value'][:len_last_bot_value-len_end_singal]
return
def convo_tokenize(convo_obj, tokenizer):
chat_desc = convo_obj['chat_desc']
instruction = convo_obj['instruction']
conversations = convo_obj['conversations']
# chat_desc
example = tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids']
EOS_TOKEN = example[-1]
example = example[:-1] # remove eos
# instruction
instruction = tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids']
instruction = instruction[1:-1] # remove bos & eos
example += instruction
for conversation in conversations:
role = conversation['from']
content = conversation['value']
print(f"role {role}, raw content {content}")
content = tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids']
content = content[1:-1] # remove bos & eos
print(f"role {role}, content {content}")
example += content
return example
print('-'*80)
print(f"text is {text}")
from cyg_conversation import default_conversation
conv = default_conversation.copy()
conv.append_message(conv.roles[0], text)
conv.append_message(conv.roles[1], None)
tokens = tokenizer.encode_plus(f"{conv.get_prompt()}", None, max_length=None)['input_ids']
tokens = tokens[1:-1]
with torch.no_grad():
out = aquila_generate(tokenizer, model, [text], max_gen_len:=200, top_p=0.95, prompts_tokens=[tokens])
print(f"pred is {out}") 另外,上传到Pypi上边的1.7.2版与Github 1.7.2版不一致。Pypi的包会报错: Traceback (most recent call last):
File "chat.py", line 4, in <module>
from flagai.model.predictor.predictor import Predictor
File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/predictor.py", line 22, in <module>
from .aquila import aquila_generate
File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/aquila.py", line 6
def aquila_generate(
^
SyntaxError: duplicate argument 'top_k' in function definition 文件 def aquila_generate(
tokenizer,
model,
prompts: List[str],
max_gen_len: int,
temperature: float = 0.8,
top_k: int = 30,
top_p: float = 0.95,
top_k: int = 30, # 重复的参数
prompts_tokens: List[List[int]] = None,
) -> List[str]:
... |
今天会发版本修复。 |
更新1.7.3,同时使用FP16精度后,在RTX3090上运行成功。 +---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |
| 0% 34C P8 32W / 350W| 15283MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1955360 C python3 15280MiB |
+---------------------------------------------------------------------------------------+ 使用FP16精度: loader = AutoLoader(
"lm",
model_dir=state_dict,
model_name=model_name,
use_cache=True,
fp16=True) |
先关闭issue,如有问题请再打开。谢谢 |
Description
在32G GPU上跑aquila-7B推理的示例代码显示out of memory,请问需要多少显存?
其他7B大模型是可以跑的,aquila模型的显存消耗会比较高吗?
Alternatives
No response
The text was updated successfully, but these errors were encountered: