-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to latest llama.cpp #118
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just two comments
@@ -0,0 +1,305 @@ | |||
# Migrate ggml file(s) with ggmf magic to ggml file with ggjt magic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just CP this file in the Dockerfile after the git clone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file is different, I modified the original script so it could be called as a function, and the way it shuffles files around also is different than the original so it works better with serge. I hope the content doesn't change too often 😅
@@ -124,7 +124,7 @@ async def event_generator(): | |||
prompt=full_prompt, | |||
params=chat.parameters, | |||
): | |||
await asyncio.sleep(0.1) | |||
await asyncio.sleep(0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the the purpose of these sleeps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we generate a token once every 100ms at best for good machines, so there's no point in checking the output buffer of the program more than that. The sleep was here to prevent the infinite loop from locking up resources by running constantly.
What I realized was that we have a chunk size of 4 bytes and we check the buffer every 0.1s. So we were fetching at most (1/0.1)*4 = 40bytes a second. Usually that's enough but when we load the initial prompt it can go a lot faster than that and we were slowing things down for no reason there. It was bad design from my side :/
The symptom of that was that you would see the CPU activity decrease but it would still take a while for the answer to appear in the chat. The answer was fully generated but was just being read slowly from the output buffer. 🤦 Now with a sleep timer of 0.01 and a chunk size of 64, I don't expect we'll have a problem haha
The new version has breaking changes that require a conversion script. This PR adds the conversion script and updates the version used in the dockerfile.