[Question]: aquila-7B OOM #334

calla212 · 2023-06-10T06:08:14Z

Description

在32G GPU上跑aquila-7B推理的示例代码显示out of memory，请问需要多少显存？
其他7B大模型是可以跑的，aquila模型的显存消耗会比较高吗？

Alternatives

No response

huntzhan · 2023-06-10T06:24:09Z

Same issue here:

Loading model aquila-7b / aquilachat-7b takes at most 107G memory.
After moving the model to CUDA, the program still use ~65G memory.
Inference on 3090 24G always trigger the CUDA OOM error.

My system information:

            .-/+oossssoo+/-.               minerva@worker
        `:+ssssssssssssssssss+:`           --------------
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 20.04.3 LTS x86_64
    .ossssssssssssssssssdMMMNysssso.       Host: Super Server 0123456789
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.4.0-125-generic
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 145 days, 29 mins
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 756 (dpkg), 5 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.0.17
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Resolution: 1024x768
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   Terminal: /dev/pts/22
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: Intel Xeon E5-2690 v4 (56) @ 3.500GHz
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   GPU: NVIDIA 83:00.0 NVIDIA Corporation Device 2204
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   GPU: NVIDIA 82:00.0 NVIDIA Corporation Device 2204
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/    GPU: NVIDIA 02:00.0 NVIDIA Corporation Device 2204
  +sssssssssdmydMMMMMMMMddddyssssssss+     GPU: NVIDIA 03:00.0 NVIDIA Corporation Device 2204
   /ssssssssssshdmNNNNmyNMMMMhssssss/      Memory: 1940MiB / 257821MiB
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

ftgreat · 2023-06-10T08:01:56Z

我们工程师正在排查这个问题

ftgreat · 2023-06-10T12:23:45Z

fixed。

后面我们发个修复版本，到时候您更新下

hanswang73 · 2023-06-10T13:08:22Z

Description

在32G GPU上跑aquila-7B推理的示例代码显示out of memory，请问需要多少显存？其他7B大模型是可以跑的，aquila模型的显存消耗会比较高吗？

Alternatives

No response

您从哪儿下载的模型文件？

calla212 · 2023-06-10T15:34:38Z

是用了这里的代码

yinguobing · 2023-06-11T13:24:06Z

1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。

hanswang73 · 2023-06-11T13:27:49Z

我在24GB的A5000上运行，也是莫名退出，连OOM错误都不报

ftgreat · 2023-06-12T02:25:02Z

1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。

可以给下执行脚本吗

ftgreat · 2023-06-12T02:25:42Z

我在24GB的A5000上运行，也是莫名退出，连OOM错误都不报

也是在1.7.1版吗

hanswang73 · 2023-06-12T02:30:05Z

我在24GB的A5000上运行，也是莫名退出，连OOM错误都不报

也是在1.7.1版吗

没注意版本，就是前天从github上打包下载的flagai的整个zip文件

yinguobing · 2023-06-12T02:44:06Z

1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。

可以给下执行脚本吗

代码从这里copy的：https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila#3-%E6%8E%A8%E7%90%86inference

import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.data.tokenizer import Tokenizer
import bminf

state_dict = "./checkpoints_in/"
model_name = 'aquila-7b' # 'aquila-33b'

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True)
model = loader.get_model()
tokenizer = loader.get_tokenizer()

model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

text = "北京在哪儿?"
text = f'{text}' 
print(f"text is {text}")
with torch.no_grad():
    out = predictor.predict_generate_randomsample(text, out_max_length=200, temperature=0)
    print(f"pred is {out}")

版本：

torch                       2.0.1+cu118          
flagai                      1.7.1                
bminf                       2.0.1

另外将 from torch._six import inf 替换为 from torch import inf。

耗尽的是CPU RAM，不是GPU RAM。

hanswang73 · 2023-06-12T02:56:09Z

1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。

可以给下执行脚本吗

代码从这里copy的：https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila#3-%E6%8E%A8%E7%90%86inference

import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.data.tokenizer import Tokenizer
import bminf

state_dict = "./checkpoints_in/"
model_name = 'aquila-7b' # 'aquila-33b'

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True)
model = loader.get_model()
tokenizer = loader.get_tokenizer()

model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

text = "北京在哪儿?"
text = f'{text}' 
print(f"text is {text}")
with torch.no_grad():
    out = predictor.predict_generate_randomsample(text, out_max_length=200, temperature=0)
    print(f"pred is {out}")

版本：

torch                       2.0.1+cu118          
flagai                      1.7.1                
bminf                       2.0.1

另外将 from torch._six import inf 替换为 from torch import inf。

耗尽的是CPU RAM，不是GPU RAM。

啊？！那需要多少CPU内存？

hazy217 · 2023-06-12T03:31:14Z

fixed。后面我们发个修复版本，到时候您更新下

按照这儿第三步推理的例子运行，还是会出现OOM的问题，40GB内存，V100显卡。

https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila#3-%E6%8E%A8%E7%90%86inference

ruolunhui · 2023-06-12T07:21:51Z

wsl2给了50g内存和64g交换空间
显存24g 提示显存不够

yinguobing · 2023-06-12T09:38:41Z

看起来AquilaChat也有同样的问题。
使用代码：https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-chat#1-%E6%8E%A8%E7%90%86inference

复现环境：

python3 -m venv .env
source .env/bin/activate
pip install -i https://mirrors.cloud.tencent.com/pypi/simple flagai
pip install -i https://mirrors.cloud.tencent.com/pypi/simple bminf
# 修正_six不存在的问题： from torch._six import inf 替换为 from torch import inf。
vim /home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/mpu/grads.py

有没有可能是依赖包版本问题？官方能否给一个requirements.txt？

$ pip freeze
absl-py==1.4.0
aiohttp==3.8.4
aiosignal==1.3.1
antlr4-python3-runtime==4.9.3
async-timeout==4.0.2
attrs==23.1.0
bminf==2.0.1
boto3==1.21.42
botocore==1.24.46
cachetools==5.3.1
certifi==2023.5.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.4
colorama==0.4.6
cpm-kernels==1.0.11
datasets==2.0.0
diffusers==0.7.2
dill==0.3.6
einops==0.3.0
filelock==3.12.1
flagai==1.7.1
frozenlist==1.3.3
fsspec==2023.6.0
ftfy==6.1.1
google-auth==2.19.1
google-auth-oauthlib==0.4.6
grpcio==1.54.2
huggingface-hub==0.15.1
idna==3.4
importlib-metadata==6.6.0
jieba==0.42.1
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
lit==16.0.5.post0
lxml==4.9.2
Markdown==3.4.3
MarkupSafe==2.1.3
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
networkx==3.1
nltk==3.6.7
numpy==1.24.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
omegaconf==2.3.0
packaging==23.1
pandas==1.3.5
Pillow==9.5.0
portalocker==2.7.0
protobuf==3.19.6
pyarrow==12.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pyDeprecate==0.3.2
python-dateutil==2.8.2
pytorch-lightning==1.6.5
pytz==2023.3
PyYAML==6.0
regex==2023.6.3
requests==2.31.0
requests-oauthlib==1.3.1
responses==0.18.0
rouge-score==0.1.2
rsa==4.9
s3transfer==0.5.2
sacrebleu==2.3.1
scikit-learn==1.0.2
scipy==1.10.1
sentencepiece==0.1.96
six==1.16.0
sympy==1.12
tabulate==0.9.0
taming-transformers-rom1504==0.0.6
tensorboard==2.9.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
threadpoolctl==3.1.0
tokenizers==0.12.1
torch==2.0.1
torchmetrics==0.11.4
torchvision==0.15.2
tqdm==4.65.0
transformers==4.20.1
triton==2.0.0
typing-extensions==4.6.3
urllib3==1.26.16
wcwidth==0.2.6
Werkzeug==2.3.6
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0

hanswang73 · 2023-06-12T19:24:53Z

我试成功了：推理的代码中，加一个device="cuda"的参数，模型会直接加载到GPU（之前是先加载到CPU，我也不知道为啥啊），加载完后，显存占用28GB，清理缓存后，16GB。7b模型。

yinguobing · 2023-06-13T01:12:45Z

感谢。AutoLoader追加device="cuda"后，现在是24G显存不够的错误。

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already 
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

羊驼系7B是没有问题的。

ftgreat · 2023-06-13T03:14:55Z

可以先清理下cuda cache。

safehumeng · 2023-06-13T08:38:43Z

使用测试脚本部署成服务，每调用一次增加显存，几次之后就回出现oom

ftgreat · 2023-06-13T08:47:12Z

使用测试脚本部署成服务，每调用一次增加显存，几次之后就回出现oom

请问用的是flagai哪个版本？
方便看下服务代码么

safehumeng · 2023-06-13T09:00:55Z

使用测试脚本部署成服务，每调用一次增加显存，几次之后就回出现oom

请问用的是flagai哪个版本？方便看下服务代码么

@ftgreat
直接在根目录下跑的，然后分支用的这个

master 0634ab4 Merge pull request version updated #341 from Anhforth/master

服务代码：

import asyncio
import websockets
import json
import numpy as np
import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor_web import Predictor
from flagai.data.tokenizer import Tokenizer
import bminf

os.environ['CUDA_VISIBLE_DEVICES'] = '1'

state_dict = "./checkpoints_in"
model_name = 'aquila-7b' # 'aquila-33b'

loader = AutoLoader(
"lm",
model_dir=state_dict,
model_name=model_name,
use_cache=True)
model = loader.get_model()
tokenizer = loader.get_tokenizer()

model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

def default_dump(obj):
"""Convert numpy classes to JSON serializable objects."""
if isinstance(obj, (np.integer, np.floating, np.bool_)):
return obj.item()
elif isinstance(obj, np.ndarray):
return obj.tolist()
else:
return obj

async def main_logic(websocket, path):
data = await websocket.recv()
request_json = json.loads(data)
print(request_json)
query = request_json["prompt"]
use_stream = request_json["stream"] if "stream" in request_json else False
max_length = request_json["maxTokens"] if "maxTokens" in request_json else 320
top_k = request_json["topK"] if "topK" in request_json else 50
temperature = request_json["temperature"] if "temperature" in request_json else 0.95
top_p = request_json["topP"] if "topP" in request_json else 0.7
do_sample = request_json["useRandom"] if "useRandom" in request_json else False
logprobs = request_json["logprobs"] if "logprobs" in request_json else 0
with torch.autocast("cuda"):
g_index = 0
for re_data in predictor.predict_generate_randomsample(query,
total_max_length=max_length,
top_k=top_k,
top_p=top_p,
temperature=temperature,
prompts_tokens=None):
print(re_data)
# await websocket.send(json.dumps(re_data, ensure_ascii=False, default=default_dump))
if "result" in re_data:
re_data["result"]["index"] = g_index
# await websocket.send(re_data.lstrip("~~").rstrip("~~"))
if re_data["finish"]:
await websocket.send(json.dumps(re_data, ensure_ascii=False, default=default_dump))
break
else:
if use_stream and re_data["usage"]["totalTokens"] % 5 == 0 and re_data["usage"]["totalTokens"] >= 20:
await websocket.send(json.dumps(re_data, ensure_ascii=False, default=default_dump))
g_index += 1
await websocket.send("close")

async def start_server():
server = await websockets.serve(main_logic, '0.0.0.0', 17862)
await server.wait_closed()

if name == "main":
asyncio.get_event_loop().run_until_complete(start_server())
asyncio.get_event_loop().run_forever()

其中引用的方法的return改成了yield

每次增加1G左右显存

ftgreat · 2023-06-13T10:00:25Z

no_grad

我觉得predict部分需要 no_grad 包一下，不然会增加显存。

safehumeng · 2023-06-13T10:12:31Z

no_grad
我觉得predict部分需要 no_grad 包一下，不然会增加显存。

好的，谢啦，我把方法加了@torch.no_grad()注解，不会增加了

ftgreat · 2023-06-13T10:17:27Z

感谢。AutoLoader追加device="cuda"后，现在是24G显存不够的错误。

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already 
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

羊驼系7B是没有问题的。

可以试试flagai 1.7.2 ，内存32G，显存16G（包括模型+一条2048tokens）

yinguobing · 2023-06-14T01:04:58Z

感谢。AutoLoader追加device="cuda"后，现在是24G显存不够的错误。
OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already 
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
羊驼系7B是没有问题的。
可以试试flagai 1.7.2 ，内存32G，显存16G（包括模型+一条2048tokens）

感谢回复！

升级到1.7.2以后，RTX 3090还是会报GPU OOM错误。

[2023-06-14 00:46:31,934] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
******************** lm aquilachat-7b
Traceback (most recent call last):
  File "chat.py", line 10, in <module>
    loader = AutoLoader(
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/auto_model/auto_loader.py", line 216, in __init__
    self.model = getattr(LazyImport(self.model_name[0]),
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 184, in from_pretrain
    return load_local(checkpoint_path, only_download_config=only_download_config)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 116, in load_local
    model.to(device)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 23.22 GiB already allocated; 169.31 MiB free; 23.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

使用代码：

import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.model.predictor.aquila import aquila_generate

state_dict = "./checkpoints_in"
model_name = 'aquilachat-7b'

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True,
    device='cuda')

model = loader.get_model()
tokenizer = loader.get_tokenizer()
cache_dir = os.path.join(state_dict, model_name)
model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

text = "北京为什么是中国的首都？"

def pack_obj(text):
    obj = dict()
    obj['id'] = 'demo'

    obj['conversations'] = []
    human = dict()
    human['from'] = 'human'
    human['value'] = text
    obj['conversations'].append(human)
    # dummy bot
    bot = dict()
    bot['from'] = 'gpt'
    bot['value'] = ''
    obj['conversations'].append(bot)

    obj['instruction'] = ''

    return obj

def delete_last_bot_end_singal(convo_obj):
    conversations = convo_obj['conversations']
    assert len(conversations) > 0 and len(conversations) % 2 == 0
    assert conversations[0]['from'] == 'human'

    last_bot = conversations[len(conversations)-1]
    assert last_bot['from'] == 'gpt'

    ## from _add_speaker_and_signal
    END_SIGNAL = "\n"
    len_end_singal = len(END_SIGNAL)
    len_last_bot_value = len(last_bot['value'])
    last_bot['value'] = last_bot['value'][:len_last_bot_value-len_end_singal]
    return

def convo_tokenize(convo_obj, tokenizer):
    chat_desc = convo_obj['chat_desc']
    instruction = convo_obj['instruction']
    conversations = convo_obj['conversations']
            
    # chat_desc
    example = tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids']
    EOS_TOKEN = example[-1]
    example = example[:-1] # remove eos
    # instruction
    instruction = tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids']
    instruction = instruction[1:-1] # remove bos & eos
    example += instruction

    for conversation in conversations:
        role = conversation['from']
        content = conversation['value']
        print(f"role {role}, raw content {content}")
        content = tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids']
        content = content[1:-1] # remove bos & eos
        print(f"role {role}, content {content}")
        example += content
    return example

print('-'*80)
print(f"text is {text}")

from cyg_conversation import default_conversation

conv = default_conversation.copy()
conv.append_message(conv.roles[0], text)
conv.append_message(conv.roles[1], None)

tokens = tokenizer.encode_plus(f"{conv.get_prompt()}", None, max_length=None)['input_ids']
tokens = tokens[1:-1]

with torch.no_grad():
    out = aquila_generate(tokenizer, model, [text], max_gen_len:=200, top_p=0.95, prompts_tokens=[tokens])
    print(f"pred is {out}")

另外，上传到Pypi上边的1.7.2版与Github 1.7.2版不一致。Pypi的包会报错：

Traceback (most recent call last):
  File "chat.py", line 4, in <module>
    from flagai.model.predictor.predictor import Predictor
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/predictor.py", line 22, in <module>
    from .aquila import aquila_generate
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/aquila.py", line 6
    def aquila_generate(
    ^
SyntaxError: duplicate argument 'top_k' in function definition

文件flagai/model/predictor/aquila.py第14行重复了一个参数：

def aquila_generate(
        tokenizer,
        model,
        prompts: List[str],
        max_gen_len: int,
        temperature: float = 0.8,
        top_k: int = 30,
        top_p: float = 0.95,
        top_k: int = 30, # 重复的参数
        prompts_tokens: List[List[int]] = None,
    ) -> List[str]:
    ...

ftgreat · 2023-06-14T05:13:48Z

感谢。AutoLoader追加device="cuda"后，现在是24G显存不够的错误。
OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already 
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
羊驼系7B是没有问题的。
可以试试flagai 1.7.2 ，内存32G，显存16G（包括模型+一条2048tokens）

感谢回复！

升级到1.7.2以后，RTX 3090还是会报GPU OOM错误。

[2023-06-14 00:46:31,934] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
******************** lm aquilachat-7b
Traceback (most recent call last):
  File "chat.py", line 10, in <module>
    loader = AutoLoader(
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/auto_model/auto_loader.py", line 216, in __init__
    self.model = getattr(LazyImport(self.model_name[0]),
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 184, in from_pretrain
    return load_local(checkpoint_path, only_download_config=only_download_config)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 116, in load_local
    model.to(device)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 23.22 GiB already allocated; 169.31 MiB free; 23.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

使用代码：

import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.model.predictor.aquila import aquila_generate

state_dict = "./checkpoints_in"
model_name = 'aquilachat-7b'

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True,
    device='cuda')

model = loader.get_model()
tokenizer = loader.get_tokenizer()
cache_dir = os.path.join(state_dict, model_name)
model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

text = "北京为什么是中国的首都？"

def pack_obj(text):
    obj = dict()
    obj['id'] = 'demo'

    obj['conversations'] = []
    human = dict()
    human['from'] = 'human'
    human['value'] = text
    obj['conversations'].append(human)
    # dummy bot
    bot = dict()
    bot['from'] = 'gpt'
    bot['value'] = ''
    obj['conversations'].append(bot)

    obj['instruction'] = ''

    return obj

def delete_last_bot_end_singal(convo_obj):
    conversations = convo_obj['conversations']
    assert len(conversations) > 0 and len(conversations) % 2 == 0
    assert conversations[0]['from'] == 'human'

    last_bot = conversations[len(conversations)-1]
    assert last_bot['from'] == 'gpt'

    ## from _add_speaker_and_signal
    END_SIGNAL = "\n"
    len_end_singal = len(END_SIGNAL)
    len_last_bot_value = len(last_bot['value'])
    last_bot['value'] = last_bot['value'][:len_last_bot_value-len_end_singal]
    return

def convo_tokenize(convo_obj, tokenizer):
    chat_desc = convo_obj['chat_desc']
    instruction = convo_obj['instruction']
    conversations = convo_obj['conversations']
            
    # chat_desc
    example = tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids']
    EOS_TOKEN = example[-1]
    example = example[:-1] # remove eos
    # instruction
    instruction = tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids']
    instruction = instruction[1:-1] # remove bos & eos
    example += instruction

    for conversation in conversations:
        role = conversation['from']
        content = conversation['value']
        print(f"role {role}, raw content {content}")
        content = tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids']
        content = content[1:-1] # remove bos & eos
        print(f"role {role}, content {content}")
        example += content
    return example

print('-'*80)
print(f"text is {text}")

from cyg_conversation import default_conversation

conv = default_conversation.copy()
conv.append_message(conv.roles[0], text)
conv.append_message(conv.roles[1], None)

tokens = tokenizer.encode_plus(f"{conv.get_prompt()}", None, max_length=None)['input_ids']
tokens = tokens[1:-1]

with torch.no_grad():
    out = aquila_generate(tokenizer, model, [text], max_gen_len:=200, top_p=0.95, prompts_tokens=[tokens])
    print(f"pred is {out}")

另外，上传到Pypi上边的1.7.2版与Github 1.7.2版不一致。Pypi的包会报错：

Traceback (most recent call last):
  File "chat.py", line 4, in <module>
    from flagai.model.predictor.predictor import Predictor
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/predictor.py", line 22, in <module>
    from .aquila import aquila_generate
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/aquila.py", line 6
    def aquila_generate(
    ^
SyntaxError: duplicate argument 'top_k' in function definition

文件flagai/model/predictor/aquila.py第14行重复了一个参数：

def aquila_generate(
        tokenizer,
        model,
        prompts: List[str],
        max_gen_len: int,
        temperature: float = 0.8,
        top_k: int = 30,
        top_p: float = 0.95,
        top_k: int = 30, # 重复的参数
        prompts_tokens: List[List[int]] = None,
    ) -> List[str]:
    ...

今天会发版本修复。

yinguobing · 2023-06-14T09:46:08Z

更新1.7.3，同时使用FP16精度后，在RTX3090上运行成功。

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:01:00.0 Off |                  N/A |
|  0%   34C    P8               32W / 350W|  15283MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1955360      C   python3                                   15280MiB |
+---------------------------------------------------------------------------------------+

使用FP16精度：

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True,
    fp16=True)

ftgreat · 2023-06-19T07:38:24Z

先关闭issue，如有问题请再打开。谢谢

calla212 added the question Further information is requested label Jun 10, 2023

ftgreat closed this as completed Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: aquila-7B OOM #334

[Question]: aquila-7B OOM #334

calla212 commented Jun 10, 2023

huntzhan commented Jun 10, 2023 •

edited

Loading

ftgreat commented Jun 10, 2023

ftgreat commented Jun 10, 2023

hanswang73 commented Jun 10, 2023

Description

Alternatives

calla212 commented Jun 10, 2023

yinguobing commented Jun 11, 2023

hanswang73 commented Jun 11, 2023

ftgreat commented Jun 12, 2023

ftgreat commented Jun 12, 2023

hanswang73 commented Jun 12, 2023

yinguobing commented Jun 12, 2023

hanswang73 commented Jun 12, 2023

hazy217 commented Jun 12, 2023 •

edited

Loading

ruolunhui commented Jun 12, 2023

yinguobing commented Jun 12, 2023

hanswang73 commented Jun 12, 2023

yinguobing commented Jun 13, 2023

ftgreat commented Jun 13, 2023

safehumeng commented Jun 13, 2023

ftgreat commented Jun 13, 2023

safehumeng commented Jun 13, 2023 •

edited

Loading

ftgreat commented Jun 13, 2023

safehumeng commented Jun 13, 2023

ftgreat commented Jun 13, 2023

yinguobing commented Jun 14, 2023

ftgreat commented Jun 14, 2023

yinguobing commented Jun 14, 2023

ftgreat commented Jun 19, 2023

[Question]: aquila-7B OOM #334

[Question]: aquila-7B OOM #334

Comments

calla212 commented Jun 10, 2023

Description

Alternatives

huntzhan commented Jun 10, 2023 • edited Loading

ftgreat commented Jun 10, 2023

ftgreat commented Jun 10, 2023

hanswang73 commented Jun 10, 2023

Description

Alternatives

calla212 commented Jun 10, 2023

yinguobing commented Jun 11, 2023

hanswang73 commented Jun 11, 2023

ftgreat commented Jun 12, 2023

ftgreat commented Jun 12, 2023

hanswang73 commented Jun 12, 2023

yinguobing commented Jun 12, 2023

hanswang73 commented Jun 12, 2023

hazy217 commented Jun 12, 2023 • edited Loading

ruolunhui commented Jun 12, 2023

yinguobing commented Jun 12, 2023

hanswang73 commented Jun 12, 2023

yinguobing commented Jun 13, 2023

ftgreat commented Jun 13, 2023

safehumeng commented Jun 13, 2023

ftgreat commented Jun 13, 2023

safehumeng commented Jun 13, 2023 • edited Loading

ftgreat commented Jun 13, 2023

safehumeng commented Jun 13, 2023

ftgreat commented Jun 13, 2023

yinguobing commented Jun 14, 2023

ftgreat commented Jun 14, 2023

yinguobing commented Jun 14, 2023

ftgreat commented Jun 19, 2023

huntzhan commented Jun 10, 2023 •

edited

Loading

hazy217 commented Jun 12, 2023 •

edited

Loading

safehumeng commented Jun 13, 2023 •

edited

Loading