Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: aquila-7B OOM #334

Closed
calla212 opened this issue Jun 10, 2023 · 28 comments
Closed

[Question]: aquila-7B OOM #334

calla212 opened this issue Jun 10, 2023 · 28 comments
Labels
question Further information is requested

Comments

@calla212
Copy link

Description

在32G GPU上跑aquila-7B推理的示例代码显示out of memory,请问需要多少显存?
其他7B大模型是可以跑的,aquila模型的显存消耗会比较高吗?

Alternatives

No response

@calla212 calla212 added the question Further information is requested label Jun 10, 2023
@huntzhan
Copy link

huntzhan commented Jun 10, 2023

Same issue here:

  1. Loading model aquila-7b / aquilachat-7b takes at most 107G memory.
  2. After moving the model to CUDA, the program still use ~65G memory.
  3. Inference on 3090 24G always trigger the CUDA OOM error.

My system information:

            .-/+oossssoo+/-.               minerva@worker
        `:+ssssssssssssssssss+:`           --------------
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 20.04.3 LTS x86_64
    .ossssssssssssssssssdMMMNysssso.       Host: Super Server 0123456789
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.4.0-125-generic
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 145 days, 29 mins
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 756 (dpkg), 5 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.0.17
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Resolution: 1024x768
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   Terminal: /dev/pts/22
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: Intel Xeon E5-2690 v4 (56) @ 3.500GHz
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   GPU: NVIDIA 83:00.0 NVIDIA Corporation Device 2204
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   GPU: NVIDIA 82:00.0 NVIDIA Corporation Device 2204
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/    GPU: NVIDIA 02:00.0 NVIDIA Corporation Device 2204
  +sssssssssdmydMMMMMMMMddddyssssssss+     GPU: NVIDIA 03:00.0 NVIDIA Corporation Device 2204
   /ssssssssssshdmNNNNmyNMMMMhssssss/      Memory: 1940MiB / 257821MiB
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

@ftgreat
Copy link
Collaborator

ftgreat commented Jun 10, 2023

我们工程师正在排查这个问题

@ftgreat
Copy link
Collaborator

ftgreat commented Jun 10, 2023

fixed。
image
后面我们发个修复版本,到时候您更新下

@hanswang73
Copy link

Description

在32G GPU上跑aquila-7B推理的示例代码显示out of memory,请问需要多少显存? 其他7B大模型是可以跑的,aquila模型的显存消耗会比较高吗?

Alternatives

No response

您从哪儿下载的模型文件?

@calla212
Copy link
Author

是用了这里的代码

@yinguobing
Copy link

1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。

@hanswang73
Copy link

我在24GB的A5000上运行,也是莫名退出,连OOM错误都不报

@ftgreat
Copy link
Collaborator

ftgreat commented Jun 12, 2023

1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。

可以给下执行脚本吗

@ftgreat
Copy link
Collaborator

ftgreat commented Jun 12, 2023

我在24GB的A5000上运行,也是莫名退出,连OOM错误都不报

也是在1.7.1版吗

@hanswang73
Copy link

我在24GB的A5000上运行,也是莫名退出,连OOM错误都不报

也是在1.7.1版吗

没注意版本,就是前天从github上打包下载的flagai的整个zip文件

@yinguobing
Copy link

1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。

可以给下执行脚本吗

代码从这里copy的:https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila#3-%E6%8E%A8%E7%90%86inference

import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.data.tokenizer import Tokenizer
import bminf

state_dict = "./checkpoints_in/"
model_name = 'aquila-7b' # 'aquila-33b'

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True)
model = loader.get_model()
tokenizer = loader.get_tokenizer()

model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

text = "北京在哪儿?"
text = f'{text}' 
print(f"text is {text}")
with torch.no_grad():
    out = predictor.predict_generate_randomsample(text, out_max_length=200, temperature=0)
    print(f"pred is {out}")

版本:

torch                       2.0.1+cu118          
flagai                      1.7.1                
bminf                       2.0.1                

另外将 from torch._six import inf 替换为 from torch import inf

耗尽的是CPU RAM,不是GPU RAM。

@hanswang73
Copy link

1.7.1版还是会遇到这个问题。32G RAM耗尽后killed。

可以给下执行脚本吗

代码从这里copy的:https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila#3-%E6%8E%A8%E7%90%86inference

import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.data.tokenizer import Tokenizer
import bminf

state_dict = "./checkpoints_in/"
model_name = 'aquila-7b' # 'aquila-33b'

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True)
model = loader.get_model()
tokenizer = loader.get_tokenizer()

model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

text = "北京在哪儿?"
text = f'{text}' 
print(f"text is {text}")
with torch.no_grad():
    out = predictor.predict_generate_randomsample(text, out_max_length=200, temperature=0)
    print(f"pred is {out}")

版本:

torch                       2.0.1+cu118          
flagai                      1.7.1                
bminf                       2.0.1                

另外将 from torch._six import inf 替换为 from torch import inf

耗尽的是CPU RAM,不是GPU RAM。

啊?!那需要多少CPU内存?

@hazy217
Copy link

hazy217 commented Jun 12, 2023

fixed。 image 后面我们发个修复版本,到时候您更新下

按照这儿第三步推理的例子运行,还是会出现OOM的问题,40GB内存,V100显卡。

https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila#3-%E6%8E%A8%E7%90%86inference

@ruolunhui
Copy link

wsl2给了50g内存和64g交换空间
显存24g 提示显存不够

@yinguobing
Copy link

看起来AquilaChat也有同样的问题。
使用代码:https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-chat#1-%E6%8E%A8%E7%90%86inference

复现环境:

python3 -m venv .env
source .env/bin/activate
pip install -i https://mirrors.cloud.tencent.com/pypi/simple flagai
pip install -i https://mirrors.cloud.tencent.com/pypi/simple bminf
# 修正_six不存在的问题: from torch._six import inf 替换为 from torch import inf。
vim /home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/mpu/grads.py

有没有可能是依赖包版本问题?官方能否给一个requirements.txt?

$ pip freeze
absl-py==1.4.0
aiohttp==3.8.4
aiosignal==1.3.1
antlr4-python3-runtime==4.9.3
async-timeout==4.0.2
attrs==23.1.0
bminf==2.0.1
boto3==1.21.42
botocore==1.24.46
cachetools==5.3.1
certifi==2023.5.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.4
colorama==0.4.6
cpm-kernels==1.0.11
datasets==2.0.0
diffusers==0.7.2
dill==0.3.6
einops==0.3.0
filelock==3.12.1
flagai==1.7.1
frozenlist==1.3.3
fsspec==2023.6.0
ftfy==6.1.1
google-auth==2.19.1
google-auth-oauthlib==0.4.6
grpcio==1.54.2
huggingface-hub==0.15.1
idna==3.4
importlib-metadata==6.6.0
jieba==0.42.1
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
lit==16.0.5.post0
lxml==4.9.2
Markdown==3.4.3
MarkupSafe==2.1.3
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
networkx==3.1
nltk==3.6.7
numpy==1.24.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
omegaconf==2.3.0
packaging==23.1
pandas==1.3.5
Pillow==9.5.0
portalocker==2.7.0
protobuf==3.19.6
pyarrow==12.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pyDeprecate==0.3.2
python-dateutil==2.8.2
pytorch-lightning==1.6.5
pytz==2023.3
PyYAML==6.0
regex==2023.6.3
requests==2.31.0
requests-oauthlib==1.3.1
responses==0.18.0
rouge-score==0.1.2
rsa==4.9
s3transfer==0.5.2
sacrebleu==2.3.1
scikit-learn==1.0.2
scipy==1.10.1
sentencepiece==0.1.96
six==1.16.0
sympy==1.12
tabulate==0.9.0
taming-transformers-rom1504==0.0.6
tensorboard==2.9.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
threadpoolctl==3.1.0
tokenizers==0.12.1
torch==2.0.1
torchmetrics==0.11.4
torchvision==0.15.2
tqdm==4.65.0
transformers==4.20.1
triton==2.0.0
typing-extensions==4.6.3
urllib3==1.26.16
wcwidth==0.2.6
Werkzeug==2.3.6
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0

@hanswang73
Copy link

我试成功了:推理的代码中,加一个device="cuda"的参数,模型会直接加载到GPU(之前是先加载到CPU,我也不知道为啥啊),加载完后,显存占用28GB,清理缓存后,16GB。7b模型。

@yinguobing
Copy link

感谢。AutoLoader追加device="cuda"后,现在是24G显存不够的错误。

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already 
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

羊驼系7B是没有问题的。

@ftgreat
Copy link
Collaborator

ftgreat commented Jun 13, 2023

可以先清理下cuda cache。

@safehumeng
Copy link

使用测试脚本部署成服务,每调用一次增加显存,几次之后就回出现oom

@ftgreat
Copy link
Collaborator

ftgreat commented Jun 13, 2023

使用测试脚本部署成服务,每调用一次增加显存,几次之后就回出现oom

请问用的是flagai哪个版本?
方便看下服务代码么

@safehumeng
Copy link

safehumeng commented Jun 13, 2023

使用测试脚本部署成服务,每调用一次增加显存,几次之后就回出现oom

请问用的是flagai哪个版本? 方便看下服务代码么

@ftgreat
直接在根目录下跑的,然后分支用的这个

服务代码:

import asyncio
import websockets
import json
import numpy as np
import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor_web import Predictor
from flagai.data.tokenizer import Tokenizer
import bminf

os.environ['CUDA_VISIBLE_DEVICES'] = '1'

state_dict = "./checkpoints_in"
model_name = 'aquila-7b' # 'aquila-33b'

loader = AutoLoader(
"lm",
model_dir=state_dict,
model_name=model_name,
use_cache=True)
model = loader.get_model()
tokenizer = loader.get_tokenizer()

model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

def default_dump(obj):
"""Convert numpy classes to JSON serializable objects."""
if isinstance(obj, (np.integer, np.floating, np.bool_)):
return obj.item()
elif isinstance(obj, np.ndarray):
return obj.tolist()
else:
return obj

async def main_logic(websocket, path):
data = await websocket.recv()
request_json = json.loads(data)
print(request_json)
query = request_json["prompt"]
use_stream = request_json["stream"] if "stream" in request_json else False
max_length = request_json["maxTokens"] if "maxTokens" in request_json else 320
top_k = request_json["topK"] if "topK" in request_json else 50
temperature = request_json["temperature"] if "temperature" in request_json else 0.95
top_p = request_json["topP"] if "topP" in request_json else 0.7
do_sample = request_json["useRandom"] if "useRandom" in request_json else False
logprobs = request_json["logprobs"] if "logprobs" in request_json else 0
with torch.autocast("cuda"):
g_index = 0
for re_data in predictor.predict_generate_randomsample(query,
total_max_length=max_length,
top_k=top_k,
top_p=top_p,
temperature=temperature,
prompts_tokens=None):
print(re_data)
# await websocket.send(json.dumps(re_data, ensure_ascii=False, default=default_dump))
if "result" in re_data:
re_data["result"]["index"] = g_index
# await websocket.send(re_data.lstrip("").rstrip(""))
if re_data["finish"]:
await websocket.send(json.dumps(re_data, ensure_ascii=False, default=default_dump))
break
else:
if use_stream and re_data["usage"]["totalTokens"] % 5 == 0 and re_data["usage"]["totalTokens"] >= 20:
await websocket.send(json.dumps(re_data, ensure_ascii=False, default=default_dump))
g_index += 1
await websocket.send("close")

async def start_server():
server = await websockets.serve(main_logic, '0.0.0.0', 17862)
await server.wait_closed()

if name == "main":
asyncio.get_event_loop().run_until_complete(start_server())
asyncio.get_event_loop().run_forever()

其中引用的方法的return改成了yield

每次增加1G左右显存

@ftgreat
Copy link
Collaborator

ftgreat commented Jun 13, 2023

no_grad

我觉得predict部分需要 no_grad 包一下,不然会增加显存。

@safehumeng
Copy link

no_grad

我觉得predict部分需要 no_grad 包一下,不然会增加显存。

好的,谢啦,我把方法加了@torch.no_grad()注解,不会增加了

@ftgreat
Copy link
Collaborator

ftgreat commented Jun 13, 2023

感谢。AutoLoader追加device="cuda"后,现在是24G显存不够的错误。

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already 
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

羊驼系7B是没有问题的。

可以试试flagai 1.7.2 ,内存32G,显存16G(包括模型+一条2048tokens)

@yinguobing
Copy link

感谢。AutoLoader追加device="cuda"后,现在是24G显存不够的错误。

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already 
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

羊驼系7B是没有问题的。

可以试试flagai 1.7.2 ,内存32G,显存16G(包括模型+一条2048tokens)

感谢回复!

升级到1.7.2以后,RTX 3090还是会报GPU OOM错误。

[2023-06-14 00:46:31,934] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
******************** lm aquilachat-7b
Traceback (most recent call last):
  File "chat.py", line 10, in <module>
    loader = AutoLoader(
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/auto_model/auto_loader.py", line 216, in __init__
    self.model = getattr(LazyImport(self.model_name[0]),
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 184, in from_pretrain
    return load_local(checkpoint_path, only_download_config=only_download_config)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 116, in load_local
    model.to(device)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 23.22 GiB already allocated; 169.31 MiB free; 23.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

使用代码:

import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.model.predictor.aquila import aquila_generate

state_dict = "./checkpoints_in"
model_name = 'aquilachat-7b'

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True,
    device='cuda')

model = loader.get_model()
tokenizer = loader.get_tokenizer()
cache_dir = os.path.join(state_dict, model_name)
model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

text = "北京为什么是中国的首都?"

def pack_obj(text):
    obj = dict()
    obj['id'] = 'demo'

    obj['conversations'] = []
    human = dict()
    human['from'] = 'human'
    human['value'] = text
    obj['conversations'].append(human)
    # dummy bot
    bot = dict()
    bot['from'] = 'gpt'
    bot['value'] = ''
    obj['conversations'].append(bot)

    obj['instruction'] = ''

    return obj

def delete_last_bot_end_singal(convo_obj):
    conversations = convo_obj['conversations']
    assert len(conversations) > 0 and len(conversations) % 2 == 0
    assert conversations[0]['from'] == 'human'

    last_bot = conversations[len(conversations)-1]
    assert last_bot['from'] == 'gpt'

    ## from _add_speaker_and_signal
    END_SIGNAL = "\n"
    len_end_singal = len(END_SIGNAL)
    len_last_bot_value = len(last_bot['value'])
    last_bot['value'] = last_bot['value'][:len_last_bot_value-len_end_singal]
    return

def convo_tokenize(convo_obj, tokenizer):
    chat_desc = convo_obj['chat_desc']
    instruction = convo_obj['instruction']
    conversations = convo_obj['conversations']
            
    # chat_desc
    example = tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids']
    EOS_TOKEN = example[-1]
    example = example[:-1] # remove eos
    # instruction
    instruction = tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids']
    instruction = instruction[1:-1] # remove bos & eos
    example += instruction

    for conversation in conversations:
        role = conversation['from']
        content = conversation['value']
        print(f"role {role}, raw content {content}")
        content = tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids']
        content = content[1:-1] # remove bos & eos
        print(f"role {role}, content {content}")
        example += content
    return example

print('-'*80)
print(f"text is {text}")

from cyg_conversation import default_conversation

conv = default_conversation.copy()
conv.append_message(conv.roles[0], text)
conv.append_message(conv.roles[1], None)

tokens = tokenizer.encode_plus(f"{conv.get_prompt()}", None, max_length=None)['input_ids']
tokens = tokens[1:-1]

with torch.no_grad():
    out = aquila_generate(tokenizer, model, [text], max_gen_len:=200, top_p=0.95, prompts_tokens=[tokens])
    print(f"pred is {out}")

另外,上传到Pypi上边的1.7.2版与Github 1.7.2版不一致。Pypi的包会报错:

Traceback (most recent call last):
  File "chat.py", line 4, in <module>
    from flagai.model.predictor.predictor import Predictor
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/predictor.py", line 22, in <module>
    from .aquila import aquila_generate
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/aquila.py", line 6
    def aquila_generate(
    ^
SyntaxError: duplicate argument 'top_k' in function definition

文件flagai/model/predictor/aquila.py第14行重复了一个参数:

def aquila_generate(
        tokenizer,
        model,
        prompts: List[str],
        max_gen_len: int,
        temperature: float = 0.8,
        top_k: int = 30,
        top_p: float = 0.95,
        top_k: int = 30, # 重复的参数
        prompts_tokens: List[List[int]] = None,
    ) -> List[str]:
    ...

@ftgreat
Copy link
Collaborator

ftgreat commented Jun 14, 2023

感谢。AutoLoader追加device="cuda"后,现在是24G显存不够的错误。

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 22.89 GiB already 
allocated; 21.31 MiB free; 22.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

羊驼系7B是没有问题的。

可以试试flagai 1.7.2 ,内存32G,显存16G(包括模型+一条2048tokens)

感谢回复!

升级到1.7.2以后,RTX 3090还是会报GPU OOM错误。

[2023-06-14 00:46:31,934] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
******************** lm aquilachat-7b
Traceback (most recent call last):
  File "chat.py", line 10, in <module>
    loader = AutoLoader(
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/auto_model/auto_loader.py", line 216, in __init__
    self.model = getattr(LazyImport(self.model_name[0]),
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 184, in from_pretrain
    return load_local(checkpoint_path, only_download_config=only_download_config)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/base_model.py", line 116, in load_local
    model.to(device)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 23.22 GiB already allocated; 169.31 MiB free; 23.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

使用代码:

import os
import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor
from flagai.model.predictor.aquila import aquila_generate

state_dict = "./checkpoints_in"
model_name = 'aquilachat-7b'

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True,
    device='cuda')

model = loader.get_model()
tokenizer = loader.get_tokenizer()
cache_dir = os.path.join(state_dict, model_name)
model.eval()
model.half()
model.cuda()

predictor = Predictor(model, tokenizer)

text = "北京为什么是中国的首都?"

def pack_obj(text):
    obj = dict()
    obj['id'] = 'demo'

    obj['conversations'] = []
    human = dict()
    human['from'] = 'human'
    human['value'] = text
    obj['conversations'].append(human)
    # dummy bot
    bot = dict()
    bot['from'] = 'gpt'
    bot['value'] = ''
    obj['conversations'].append(bot)

    obj['instruction'] = ''

    return obj

def delete_last_bot_end_singal(convo_obj):
    conversations = convo_obj['conversations']
    assert len(conversations) > 0 and len(conversations) % 2 == 0
    assert conversations[0]['from'] == 'human'

    last_bot = conversations[len(conversations)-1]
    assert last_bot['from'] == 'gpt'

    ## from _add_speaker_and_signal
    END_SIGNAL = "\n"
    len_end_singal = len(END_SIGNAL)
    len_last_bot_value = len(last_bot['value'])
    last_bot['value'] = last_bot['value'][:len_last_bot_value-len_end_singal]
    return

def convo_tokenize(convo_obj, tokenizer):
    chat_desc = convo_obj['chat_desc']
    instruction = convo_obj['instruction']
    conversations = convo_obj['conversations']
            
    # chat_desc
    example = tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids']
    EOS_TOKEN = example[-1]
    example = example[:-1] # remove eos
    # instruction
    instruction = tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids']
    instruction = instruction[1:-1] # remove bos & eos
    example += instruction

    for conversation in conversations:
        role = conversation['from']
        content = conversation['value']
        print(f"role {role}, raw content {content}")
        content = tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids']
        content = content[1:-1] # remove bos & eos
        print(f"role {role}, content {content}")
        example += content
    return example

print('-'*80)
print(f"text is {text}")

from cyg_conversation import default_conversation

conv = default_conversation.copy()
conv.append_message(conv.roles[0], text)
conv.append_message(conv.roles[1], None)

tokens = tokenizer.encode_plus(f"{conv.get_prompt()}", None, max_length=None)['input_ids']
tokens = tokens[1:-1]

with torch.no_grad():
    out = aquila_generate(tokenizer, model, [text], max_gen_len:=200, top_p=0.95, prompts_tokens=[tokens])
    print(f"pred is {out}")

另外,上传到Pypi上边的1.7.2版与Github 1.7.2版不一致。Pypi的包会报错:

Traceback (most recent call last):
  File "chat.py", line 4, in <module>
    from flagai.model.predictor.predictor import Predictor
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/predictor.py", line 22, in <module>
    from .aquila import aquila_generate
  File "/home/robin/aquila-7b/.env/lib/python3.8/site-packages/flagai/model/predictor/aquila.py", line 6
    def aquila_generate(
    ^
SyntaxError: duplicate argument 'top_k' in function definition

文件flagai/model/predictor/aquila.py第14行重复了一个参数:

def aquila_generate(
        tokenizer,
        model,
        prompts: List[str],
        max_gen_len: int,
        temperature: float = 0.8,
        top_k: int = 30,
        top_p: float = 0.95,
        top_k: int = 30, # 重复的参数
        prompts_tokens: List[List[int]] = None,
    ) -> List[str]:
    ...

今天会发版本修复。

@yinguobing
Copy link

更新1.7.3,同时使用FP16精度后,在RTX3090上运行成功。

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:01:00.0 Off |                  N/A |
|  0%   34C    P8               32W / 350W|  15283MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1955360      C   python3                                   15280MiB |
+---------------------------------------------------------------------------------------+

使用FP16精度:

loader = AutoLoader(
    "lm",
    model_dir=state_dict,
    model_name=model_name,
    use_cache=True,
    fp16=True)

@ftgreat
Copy link
Collaborator

ftgreat commented Jun 19, 2023

先关闭issue,如有问题请再打开。谢谢

@ftgreat ftgreat closed this as completed Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants