Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too slow? #65

Closed
manageseverin opened this issue Mar 25, 2023 · 31 comments
Closed

too slow? #65

manageseverin opened this issue Mar 25, 2023 · 31 comments

Comments

@manageseverin
Copy link

I'm using it under Windwos 11 with alpaca 7B
Ok, it's great overall, but I have a native cpp version (chat.exe) and it's running 2 times faster than your docker version.
Also, how to use the API ? I saw in docker something like 127.0.0.1:35272 - "GET /chat/5fe89704-c7ca-4a67-9ec2-f267689b0ffe/question?prompt=No%2C+it%27s+actually+14 HTTP/1.1" 200 OK
But where to look for proper API documentation ?

@mokahless
Copy link

Ubuntu 22. using 30B

While a significant amount of this issue appears to be CPU processing, I am watching resource usage and noticing every time it finishes an answer, it unloads the model from ram, and so for every question I ask it, it has to read the entire thing from disk into ram again. For my system, this wastes about 15 seconds every question. This portion seems to scale with the threads you give it, too. The same attempt with a quarter of the threads took ~25 seconds to move from disk to ram.

Seems to me this could be sped up by keeping the model in ram? Is that possible or is there something more complex going on that can't be improved upon?

@y12studio
Copy link

y12studio commented Mar 26, 2023

@manageseverin

Try this docs URL. Serge - Swagger UI http://localhost:8008/api/docs

Screenshot from 2023-03-26 13-15-15

@cyberius0
Copy link

cyberius0 commented Mar 27, 2023

Very easy to use, it works. But yes, very slow for me, too (30B model).
Takes about several minutes till the words begin to appear after I enter a prompt. But then the words appear quite after another.

edit: Looking at RAM consumption it seems the model is indeed unloaded after every response.
System: Windows 10 64-bit, Intel Core i7 8700K @ 3.70GHz, 32,0GB Dual-Channel DDR4 @ 1600MHz

@manageseverin
Copy link
Author

On alpaca 7B model, it takes about a minute for the answer to start to appear. Win 11, Ryzen 5600G/16GB RAM
I thought that it depends on the history size it need to feed back, but no - even for the first time (when there is no history) it also takes the same time.

@magicmars35
Copy link

@nsarrazin Hi Nathan, do you think it's possible to load the model in RAM, and keep it in ram, as long as the user querries are submitted ?

@nsarrazin
Copy link
Member

Yes it's on the list!

@alph4b3th
Copy link

hey, it's really slow! amd epyc 16gb ram, and a lot of delay. more than a minute to load the model to ram. (SSD)

@alph4b3th
Copy link

wtf? I have a powerful server! how heavy is this?
image

@futurepr0n
Copy link

image
I am also fairly beefy specs - I find it slow to run but I am guessing its due to the issue pointed out offloading the model. If we could keep it persistent it would work best, but I could see the reason for not wanting that maybe that its tricky to manage multiple sessions if you implement it right now? I don't know - maybe we could have a superuser option if we have the available ram to keep it loaded full time?

@alph4b3th
Copy link

the problem is not serge, it's in llama.cpp. Something doesn't work well with docker, I saw 'npx dalai serve' run and the model responded in 3-5 seconds.. with docker here on my server, it took between 18-60min to initialize and load to ram, and 8 minutes for the model to finish its relatively small response.

@Mattssn
Copy link

Mattssn commented Mar 29, 2023

cIEGaZy

I am playing with the 30B Model, and I am having the same thing, I am running this on Docker on a pretty beefy box, but getting pretty slow response times. Something to maybe have a variable to keep the model alive for 10 minutes then shut it down for inactivity, tbh, my server really never passes like 14gb of memory usage, an option to keep it loaded as long as the docker container is running might be cool too, I know you said you are working on it, just wanted to give my feedback :)

@voarsh2
Copy link

voarsh2 commented Mar 29, 2023

Quite unusable to constantly need to load/unload 4GB's in RAM. Not everyone is on SSD either, so then you have to contend with waiting on disks as well to load in RAM for every chat submission.

I installed it outside of docker and did not get different results from those already mentioned above. What is it? Because I've seen some people running alpaca 7B and it loads and responds in seconds, and on my machine which is a powerful computer for servers, even the 7B is extremely slow, even consuming 6 cores (I upgraded from intel xeon to amd epyc which reduced the answer to 8 minutes and the load to 18 minutes)

Similar experience.
I've tried 7B and 30. I've fed it 50GB's of RAM, giving it 32 cores of CPU (32 x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (2 Sockets)) and on a 48 x Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (2 Sockets) - always takes several minutes to load the model back into memory after each response. Realistically, it takes a good 5-10 minutes between each simple response and even longer for typing out multi-line responses. I have a Ryzen 5x (12 x AMD Ryzen 5 3600X 6-Core Processor (1 Socket)) system, that doesn't meet the RAM requirement (a kubernetes node with existing workload taking up available RAM) so I won't bother even trying - the 2 CPU E5-2697 v2 system is on PAR with the Ryzen CPU system. I am not sure why the default threads are 4, increasing the threads for the model doesn't seemingly make responses any faster, but grind the CPU to starvation....

I have an RTX 260 GPU that I can't use for this project... a shame. Would much rather use a GPU for this type of task, even modern day CPU's struggle with this project.

Given there's no multi-server support/workers, I can't make use of the Kubernetes deployment beyond single node compute........ (there's no multi-server worker logic)

--- edit:
I did test on AMD Ryzen 5 3600X 6-Core - not much faster (but faster at printing the text out word by word. 🤷) by any means (2019 CPU vs 2013 CPU, although they are comparable).

@alph4b3th
Copy link

I installed it outside of docker and did not get different results from those already mentioned above. What is it? Because I've seen some people running alpaca 7B and it loads and responds in seconds, and on my machine which is a powerful computer for servers, even the 7B is extremely slow, even consuming 6 cores (I upgraded from intel xeon to amd epyc which reduced the answer to 8 minutes and the load to 18 minutes)

@alph4b3th
Copy link

I discovered that the problem is in the compilation of the new version of llama.cpp, in which the parameters passed by the compiler are making the software slower, and that an older version like: https://github.com/nomic-ai/gpt4all
it works faster because it doesn't have these optimizations

@johncadengo
Copy link
Contributor

@voarsh2 you mentioned the default threads are 4. Where is this located? Is there a way to change it?

@voarsh2
Copy link

voarsh2 commented Mar 30, 2023

@voarsh2 you mentioned the default threads are 4. Where is this located? Is there a way to change it?

On the homepage there's "model settings" where you can change the number of threads

@johncadengo
Copy link
Contributor

I see, thanks @voarsh2. Let me know if you figure out a way to make it more performant on your CPUs since I have a few servers w/ similar CPUs as the ones you mentioned (Xeon E5-2600 v2 series).

@voarsh2
Copy link

voarsh2 commented Mar 30, 2023

I see, thanks @voarsh2. Let me know if you figure out a way to make it more performant on your CPUs since I have a few servers w/ similar CPUs as the ones you mentioned (Xeon E5-2600 v2 series).

Haha sure, hopefully the maintainer can work it out, from this issue. I'm genuinely curious to know how the maintainer got the response speed he gets from his gif/demo...... It's not like anyone in this issue is trying to run it on a Raspberry Pi. Lol. People are using EPYC and Ryzen CPU's.......

I am going to try dalai next - maybe some better luck with a different codebase.

I discovered that the problem is in the compilation of the new version of llama.cpp, in which the parameters passed by the compiler are making the software slower, and that an older version like: https://github.com/nomic-ai/gpt4all
it works faster because it doesn't have these optimizations

You might help the maintainer by providing some specifics on these optimisations and your proof?

@alph4b3th
Copy link

you can try to read the thread

@psociety
Copy link

Maybe this helps? I asked ChatGPT to refactor the code so the subprocess remains open rather than being started on each request.

api/src/serge/utils/generate.py:

import subprocess, os
from serge.models.chat import Chat, ChatParameters
import asyncio
import logging

logger = logging.getLogger(__name__)

async def generate(
    prompt: str,
    params: ChatParameters,
    procLlama: asyncio.subprocess.Process,
    CHUNK_SIZE: int
):
    await params.fetch_all_links()

    args = (
        "llama",
        "--model",
        "/usr/src/app/weights/" + params.model + ".bin",
        "--prompt",
        prompt,
        "--n_predict",
        str(params.max_length),
        "--temp",
        str(params.temperature),
        "--top_k",
        str(params.top_k),
        "--top_p",
        str(params.top_p),
        "--repeat_last_n",
        str(params.repeat_last_n),
        "--repeat_penalty",
        str(params.repeat_penalty),
        "--ctx_size",
        str(params.context_window),
        "--threads",
        str(params.n_threads),
        "--n_parts",
        "1",
    )

    logger.debug(f"Calling LLaMa with arguments", args)
    
    procLlama.stdin.write('\n'.join(args).encode() + b'\n')
    await procLlama.stdin.drain()
    
    while True:
        chunk = await procLlama.stdout.read(CHUNK_SIZE)

        if not chunk:
            return_code = await procLlama.wait()

            if return_code != 0:
                error_output = await procLlama.stderr.read()
                logger.error(error_output.decode("utf-8"))
                raise ValueError(f"RETURN CODE {return_code}\n\n"+error_output.decode("utf-8"))

        try:
            chunk = chunk.decode("utf-8")
        except UnicodeDecodeError:
            continue

        yield chunk


async def get_full_prompt_from_chat(chat: Chat, simple_prompt: str, procLlama: asyncio.subprocess.Process):
    await chat.fetch_all_links()
    
    await chat.parameters.fetch_link(ChatParameters.init_prompt)

    prompt = chat.parameters.init_prompt + "\n\n"
    
    if chat.questions != None:
        for question in chat.questions:
            if question.error != None: # skip errored out prompts
                continue
            prompt += "### Instruction:\n" + question.question + "\n"
            prompt += "### Response:\n" + question.answer + "\n"

    prompt += "### Instruction:\n" + simple_prompt + "\n"
    prompt += "### Response:\n"

    procLlama.stdin.write(prompt.encode() + b'\n')
    await procLlama.stdin.drain()

    return prompt


async def main():
    CHUNK_SIZE = 4
    procLlama = await asyncio.create_subprocess_exec(
        "llama",
        stdin=asyncio.subprocess.PIPE,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )

    prompt = "hello"
    params = ChatParameters()

    async for chunk in generate(prompt, params, procLlama, CHUNK_SIZE):
        print(chunk)

    prompt = "world"
    async for chunk in generate(prompt, params, procLlama, CHUNK_SIZE):
        print(chunk)

    procLlama.stdin.write(b"quit\n")
    await procLlama.stdin.drain()
    await procLlama.wait()

if __name__ == "__main__":
    asyncio.run(main())

I don't even know if this is the issue because i know nothing about Python and i just surfed the code.

@johncadengo
Copy link
Contributor

johncadengo commented Apr 1, 2023

There's a new PR that was just merged in: ggerganov/llama.cpp#613

I was able to compile this on my servers (with Xeon E5-2600 v2 series CPUs) and have it work quite well. Is there any way we can get the latest version of llama.cpp in Serge? Might solve all the performance issues. Just have to make sure you use the script migrate-ggml-2023-03-30-pr613.py to migrate the models to work with the new file format.

More context here: ggerganov/llama.cpp#638 (comment)

@johncadengo
Copy link
Contributor

Looks like the latest PR incorporated updates for the new change to llama.cpp: #118

I'll try it out today and let you know if it helps @voarsh2

@johncadengo
Copy link
Contributor

I'm pleased to report that as of the latest commit (cf84d0c) the performance is much better, at least on my CPUs, which were impossibly slow before.

cc @voarsh2, one thing to note is that by default is use 4 threads. I've increased that to the max number on my machines (32 in my particular case), and I started with GPT4All as the model, since it's a much smaller model and more performant. Getting great results with this test.

By the way, is there a way to get the default threads to be set according to the number of threads available?

@voarsh2
Copy link

voarsh2 commented Apr 4, 2023

cc @voarsh2, one thing to note is that by default is use 4 threads. I've increased that to the max number on my machines (32 in my particular case), and I started with GPT4All as the model, since it's a much smaller model and more performant. Getting great results with this test.

Hmmm. GPT4ALL is MUCH faster.......... 👁️
It's the same size as 7B though.......

Still rough around the edges. 🗡️
Hoping it gets faster or GPU support. Wish consumers had bigger VRAM for GPU's lol

But it does seem like cf84d0c and other upstream changes have helped a bit. Once it's done the RAM bit it's somewhat bearable (can't hurt for more performance tuning) - the major thing for me is the load/unloading in memory for every submission.

@alph4b3th
Copy link

I'm pleased to report that as of the latest commit (cf84d0c) the performance is much better, at least on my CPUs, which were impossibly slow before.

cc @voarsh2, one thing to note is that by default is use 4 threads. I've increased that to the max number on my machines (32 in my particular case), and I started with GPT4All as the model, since it's a much smaller model and more performant. Getting great results with this test.

By the way, is there a way to get the default threads to be set according to the number of threads available?

what is your hardware? how long does it take to answer you? I'm running on an amd epyc vps with 6 cores and 16gb of ram and it seems to respond after 3 minutes.

@voarsh2
Copy link

voarsh2 commented Apr 4, 2023

what is your hardware? how long does it take to answer you? I'm running on an amd epyc vps with 6 cores and 16gb of ram and it seems to respond after 3 minutes.

"...giving it 32 cores of CPU (32 x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (2 Sockets)) and on a 48 x Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (2 Sockets) - I have a Ryzen 5x (12 x AMD Ryzen 5 3600X 6-Core Processor (1 Socket)) system ..."

"I have an RTX 260 GPU that I can't use for this project... a shame."

@alph4b3th
Copy link

what is your hardware? how long does it take to answer you? I'm running on an amd epyc vps with 6 cores and 16gb of ram and it seems to respond after 3 minutes.

"...giving it 32 cores of CPU (32 x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (2 Sockets)) and on a 48 x Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (2 Sockets) - I have a Ryzen 5x (12 x AMD Ryzen 5 3600X 6-Core Processor (1 Socket)) system ..."

"I have an RTX 260 GPU that I can't use for this project... a shame."

from what I understand, your machine is not virtualized. Right? Well, how long can the model answer this question: "could you explain to me in detail how bitcoin works? I would like a technical article in a language for laymen." Could you test it for me?

@voarsh2
Copy link

voarsh2 commented Apr 4, 2023

could you explain to me in detail how bitcoin works? I would like a technical article in a language for laymen.

4 threads, using 13B model took about 6 minutes to show any text. Excluding the initial read from disk as I had sent a chat before (but it still loads/unloads lots of memory, but not reading from disk......)

About 2 minutes to print out this incomplete text:
"Bitcoins are digital currency that can be used as payment online or offline, just like cash and credit cards today. Bitcoins use peer-to-peer technology to operate with no central authority; managing transactions and the issuing of bitcoins is carried out collectively by the network.Bit"

7B-Native took about a minute to start printing finished after two minutes or so.
"Bitcoin is an innovative digital currency that uses cryptography to secure and verify transactions, creating what is known as a blockchain distributed ledger system. It operates through decentralized networks of computers which are constantly verifying the chain of past transactions in order to maintain accuracy and security. The network also continuously creates new blocks or “coins” when users send funds from one address to another. This allows for digital payments without any middleman, making it a truly peer-to-peer system with no central authority controlling its operations."

Ran again and it took about 3 minutes to start printing. Another 2 mins to finish.

@alph4b3th
Copy link

could you explain to me in detail how bitcoin works? I would like a technical article in a language for laymen.

4 threads, using 13B model took about 6 minutes to show any text. Excluding the initial read from disk as I had sent a chat before (but it still loads/unloads lots of memory, but not reading from disk......)

About 2 minutes to print out this incomplete text: "Bitcoins are digital currency that can be used as payment online or offline, just like cash and credit cards today. Bitcoins use peer-to-peer technology to operate with no central authority; managing transactions and the issuing of bitcoins is carried out collectively by the network.Bit"

it took 16 minutes here with 6 threads to generate this text: "Surely, Bitcoins work through cryptography and blockchain technology that enables them to be transferred from one user's wallet to another without any middleman or central authority involved. It is an open-source software which can run on anyone's computer hardware with a high degree of security as it uses peer-to-peer networking, making the transactions transparent and verifiable for all users in real time through distributed ledger technology (DLT)."

@voarsh2
Copy link

voarsh2 commented Apr 4, 2023

what is your hardware? how long does it take to answer you? I'm running on an amd epyc vps with 6 cores and 16gb of ram and it seems to respond after 3 minutes.

Just noticed the "VPS" bit - I don't know if it's dedicated, that they dedicate cores or not... check the fine print, they may be throttling you in some way. This type of compute requires no trickery from the provider to get full speed needed. Also, IOPS. Naturally I am expecting you to have faster RAM than me, and I am on HDD, and even loading the model is faster than you with 2013 RAM (DDR3) and HDD...... (even worse, I am on Ceph, which is only realistically giving me 80 MB/s..... at best..... (distributed storage, 1Gbe limited, not native SAS/SATA speed, and a distributed storage system not known for... performance, but reliability and data safety). I wouldn't run this on Cloud compute unless you're paying a pretty buck to ensure you're getting the resources with no visible or invisible "tuning" from the provider (as they often overprovision, naturally)

from what I understand, your machine is not virtualized. Right?

I am running machine -> Debian/Proxmox -> VM -> Kubernetes/Docker (so the host will have other deployments running as well on K8) with Cephfs (I wouldn't normally use Cephfs, but the template used RWX) backed storage for the workload)

@gaby
Copy link
Member

gaby commented May 14, 2023

We can take this discussion to Discord

@gaby gaby closed this as completed May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests