Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to latest llama.cpp #118

Merged
merged 3 commits into from
Mar 31, 2023
Merged

Update to latest llama.cpp #118

merged 3 commits into from
Mar 31, 2023

Conversation

nsarrazin
Copy link
Member

The new version has breaking changes that require a conversion script. This PR adds the conversion script and updates the version used in the dockerfile.

Copy link
Member

@gaby gaby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just two comments

@@ -0,0 +1,305 @@
# Migrate ggml file(s) with ggmf magic to ggml file with ggjt magic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just CP this file in the Dockerfile after the git clone?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is different, I modified the original script so it could be called as a function, and the way it shuffles files around also is different than the original so it works better with serge. I hope the content doesn't change too often 😅

@@ -124,7 +124,7 @@ async def event_generator():
prompt=full_prompt,
params=chat.parameters,
):
await asyncio.sleep(0.1)
await asyncio.sleep(0.01)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the the purpose of these sleeps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we generate a token once every 100ms at best for good machines, so there's no point in checking the output buffer of the program more than that. The sleep was here to prevent the infinite loop from locking up resources by running constantly.

What I realized was that we have a chunk size of 4 bytes and we check the buffer every 0.1s. So we were fetching at most (1/0.1)*4 = 40bytes a second. Usually that's enough but when we load the initial prompt it can go a lot faster than that and we were slowing things down for no reason there. It was bad design from my side :/

The symptom of that was that you would see the CPU activity decrease but it would still take a while for the answer to appear in the chat. The answer was fully generated but was just being read slowly from the output buffer. 🤦 Now with a sleep timer of 0.01 and a chunk size of 64, I don't expect we'll have a problem haha

@nsarrazin nsarrazin merged commit b806c5a into main Mar 31, 2023
@nsarrazin nsarrazin deleted the feat/update_to_latest_llama_cpp branch March 31, 2023 18:42
@johncadengo johncadengo mentioned this pull request Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants