-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
too slow? #65
Comments
Ubuntu 22. using 30B While a significant amount of this issue appears to be CPU processing, I am watching resource usage and noticing every time it finishes an answer, it unloads the model from ram, and so for every question I ask it, it has to read the entire thing from disk into ram again. For my system, this wastes about 15 seconds every question. This portion seems to scale with the threads you give it, too. The same attempt with a quarter of the threads took ~25 seconds to move from disk to ram. Seems to me this could be sped up by keeping the model in ram? Is that possible or is there something more complex going on that can't be improved upon? |
Try this docs URL. Serge - Swagger UI http://localhost:8008/api/docs |
Very easy to use, it works. But yes, very slow for me, too (30B model). edit: Looking at RAM consumption it seems the model is indeed unloaded after every response. |
On alpaca 7B model, it takes about a minute for the answer to start to appear. Win 11, Ryzen 5600G/16GB RAM |
@nsarrazin Hi Nathan, do you think it's possible to load the model in RAM, and keep it in ram, as long as the user querries are submitted ? |
Yes it's on the list! |
hey, it's really slow! amd epyc 16gb ram, and a lot of delay. more than a minute to load the model to ram. (SSD) |
the problem is not serge, it's in llama.cpp. Something doesn't work well with docker, I saw 'npx dalai serve' run and the model responded in 3-5 seconds.. with docker here on my server, it took between 18-60min to initialize and load to ram, and 8 minutes for the model to finish its relatively small response. |
I am playing with the 30B Model, and I am having the same thing, I am running this on Docker on a pretty beefy box, but getting pretty slow response times. Something to maybe have a variable to keep the model alive for 10 minutes then shut it down for inactivity, tbh, my server really never passes like 14gb of memory usage, an option to keep it loaded as long as the docker container is running might be cool too, I know you said you are working on it, just wanted to give my feedback :) |
Quite unusable to constantly need to load/unload 4GB's in RAM. Not everyone is on SSD either, so then you have to contend with waiting on disks as well to load in RAM for every chat submission.
Similar experience. I have an RTX 260 GPU that I can't use for this project... a shame. Would much rather use a GPU for this type of task, even modern day CPU's struggle with this project. Given there's no multi-server support/workers, I can't make use of the Kubernetes deployment beyond single node compute........ (there's no multi-server worker logic) --- edit: |
I installed it outside of docker and did not get different results from those already mentioned above. What is it? Because I've seen some people running alpaca 7B and it loads and responds in seconds, and on my machine which is a powerful computer for servers, even the 7B is extremely slow, even consuming 6 cores (I upgraded from intel xeon to amd epyc which reduced the answer to 8 minutes and the load to 18 minutes) |
I discovered that the problem is in the compilation of the new version of llama.cpp, in which the parameters passed by the compiler are making the software slower, and that an older version like: https://github.com/nomic-ai/gpt4all |
@voarsh2 you mentioned the default threads are 4. Where is this located? Is there a way to change it? |
On the homepage there's "model settings" where you can change the number of threads |
I see, thanks @voarsh2. Let me know if you figure out a way to make it more performant on your CPUs since I have a few servers w/ similar CPUs as the ones you mentioned (Xeon E5-2600 v2 series). |
Haha sure, hopefully the maintainer can work it out, from this issue. I'm genuinely curious to know how the maintainer got the response speed he gets from his gif/demo...... It's not like anyone in this issue is trying to run it on a Raspberry Pi. Lol. People are using EPYC and Ryzen CPU's....... I am going to try dalai next - maybe some better luck with a different codebase.
You might help the maintainer by providing some specifics on these optimisations and your proof? |
you can try to read the thread |
Maybe this helps? I asked ChatGPT to refactor the code so the subprocess remains open rather than being started on each request.
import subprocess, os
from serge.models.chat import Chat, ChatParameters
import asyncio
import logging
logger = logging.getLogger(__name__)
async def generate(
prompt: str,
params: ChatParameters,
procLlama: asyncio.subprocess.Process,
CHUNK_SIZE: int
):
await params.fetch_all_links()
args = (
"llama",
"--model",
"/usr/src/app/weights/" + params.model + ".bin",
"--prompt",
prompt,
"--n_predict",
str(params.max_length),
"--temp",
str(params.temperature),
"--top_k",
str(params.top_k),
"--top_p",
str(params.top_p),
"--repeat_last_n",
str(params.repeat_last_n),
"--repeat_penalty",
str(params.repeat_penalty),
"--ctx_size",
str(params.context_window),
"--threads",
str(params.n_threads),
"--n_parts",
"1",
)
logger.debug(f"Calling LLaMa with arguments", args)
procLlama.stdin.write('\n'.join(args).encode() + b'\n')
await procLlama.stdin.drain()
while True:
chunk = await procLlama.stdout.read(CHUNK_SIZE)
if not chunk:
return_code = await procLlama.wait()
if return_code != 0:
error_output = await procLlama.stderr.read()
logger.error(error_output.decode("utf-8"))
raise ValueError(f"RETURN CODE {return_code}\n\n"+error_output.decode("utf-8"))
try:
chunk = chunk.decode("utf-8")
except UnicodeDecodeError:
continue
yield chunk
async def get_full_prompt_from_chat(chat: Chat, simple_prompt: str, procLlama: asyncio.subprocess.Process):
await chat.fetch_all_links()
await chat.parameters.fetch_link(ChatParameters.init_prompt)
prompt = chat.parameters.init_prompt + "\n\n"
if chat.questions != None:
for question in chat.questions:
if question.error != None: # skip errored out prompts
continue
prompt += "### Instruction:\n" + question.question + "\n"
prompt += "### Response:\n" + question.answer + "\n"
prompt += "### Instruction:\n" + simple_prompt + "\n"
prompt += "### Response:\n"
procLlama.stdin.write(prompt.encode() + b'\n')
await procLlama.stdin.drain()
return prompt
async def main():
CHUNK_SIZE = 4
procLlama = await asyncio.create_subprocess_exec(
"llama",
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
prompt = "hello"
params = ChatParameters()
async for chunk in generate(prompt, params, procLlama, CHUNK_SIZE):
print(chunk)
prompt = "world"
async for chunk in generate(prompt, params, procLlama, CHUNK_SIZE):
print(chunk)
procLlama.stdin.write(b"quit\n")
await procLlama.stdin.drain()
await procLlama.wait()
if __name__ == "__main__":
asyncio.run(main()) I don't even know if this is the issue because i know nothing about Python and i just surfed the code. |
There's a new PR that was just merged in: ggerganov/llama.cpp#613 I was able to compile this on my servers (with Xeon E5-2600 v2 series CPUs) and have it work quite well. Is there any way we can get the latest version of llama.cpp in Serge? Might solve all the performance issues. Just have to make sure you use the script More context here: ggerganov/llama.cpp#638 (comment) |
I'm pleased to report that as of the latest commit (cf84d0c) the performance is much better, at least on my CPUs, which were impossibly slow before. cc @voarsh2, one thing to note is that by default is use 4 threads. I've increased that to the max number on my machines (32 in my particular case), and I started with By the way, is there a way to get the default threads to be set according to the number of threads available? |
Hmmm. GPT4ALL is MUCH faster.......... 👁️ Still rough around the edges. 🗡️ But it does seem like cf84d0c and other upstream changes have helped a bit. Once it's done the RAM bit it's somewhat bearable (can't hurt for more performance tuning) - the major thing for me is the load/unloading in memory for every submission. |
what is your hardware? how long does it take to answer you? I'm running on an amd epyc vps with 6 cores and 16gb of ram and it seems to respond after 3 minutes. |
"...giving it 32 cores of CPU (32 x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (2 Sockets)) and on a 48 x Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (2 Sockets) - I have a Ryzen 5x (12 x AMD Ryzen 5 3600X 6-Core Processor (1 Socket)) system ..." "I have an RTX 260 GPU that I can't use for this project... a shame." |
from what I understand, your machine is not virtualized. Right? Well, how long can the model answer this question: "could you explain to me in detail how bitcoin works? I would like a technical article in a language for laymen." Could you test it for me? |
4 threads, using 13B model took about 6 minutes to show any text. Excluding the initial read from disk as I had sent a chat before (but it still loads/unloads lots of memory, but not reading from disk......) About 2 minutes to print out this incomplete text: 7B-Native took about a minute to start printing finished after two minutes or so. Ran again and it took about 3 minutes to start printing. Another 2 mins to finish. |
it took 16 minutes here with 6 threads to generate this text: "Surely, Bitcoins work through cryptography and blockchain technology that enables them to be transferred from one user's wallet to another without any middleman or central authority involved. It is an open-source software which can run on anyone's computer hardware with a high degree of security as it uses peer-to-peer networking, making the transactions transparent and verifiable for all users in real time through distributed ledger technology (DLT)." |
Just noticed the "VPS" bit - I don't know if it's dedicated, that they dedicate cores or not... check the fine print, they may be throttling you in some way. This type of compute requires no trickery from the provider to get full speed needed. Also, IOPS. Naturally I am expecting you to have faster RAM than me, and I am on HDD, and even loading the model is faster than you with 2013 RAM (DDR3) and HDD...... (even worse, I am on Ceph, which is only realistically giving me 80 MB/s..... at best..... (distributed storage, 1Gbe limited, not native SAS/SATA speed, and a distributed storage system not known for... performance, but reliability and data safety). I wouldn't run this on Cloud compute unless you're paying a pretty buck to ensure you're getting the resources with no visible or invisible "tuning" from the provider (as they often overprovision, naturally)
I am running machine -> Debian/Proxmox -> VM -> Kubernetes/Docker (so the host will have other deployments running as well on K8) with Cephfs (I wouldn't normally use Cephfs, but the template used RWX) backed storage for the workload) |
We can take this discussion to Discord |
I'm using it under Windwos 11 with alpaca 7B
Ok, it's great overall, but I have a native cpp version (chat.exe) and it's running 2 times faster than your docker version.
Also, how to use the API ? I saw in docker something like 127.0.0.1:35272 - "GET /chat/5fe89704-c7ca-4a67-9ec2-f267689b0ffe/question?prompt=No%2C+it%27s+actually+14 HTTP/1.1" 200 OK
But where to look for proper API documentation ?
The text was updated successfully, but these errors were encountered: