-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability for ./main
to keep the model in memory and pass it more text
#23
Comments
I made a fork (https://github.com/j-f1/forked-llama.cpp/tree/swift) that’s focused around working as a library rather than a standalone program. I don’t know how hard it would be to bridge that to Python but you might find some of the changes useful for writing a C++ command line program you can talk to via the command line. |
agreed, a chat mode would be a lot better. The prompts this model generates are very bizarre lmao |
Modifying Interfacing this with the outside world will take some more effort. |
One thing I've done previously is to drop in a single-file HTTP server, like this one, and then make an HTTP API. (Optionally also a single-file JSON parser/serializer, like this one, so that you can make the API JSON-based.) It's a little silly, vs building as a library or adding python bindings or whatever, but it's cross-platform and very easy to get it working (~60 lines, or a little more if you want to stream the results rather than just sending them all when it's done). example server#include <cstdio>
#include "httplib.h"
using namespace httplib;
#define PORT 8080
std::string get_response(std::string message) {
// call actual API here
return "got message: " + message;
}
int main(void) {
Server svr;
if (!svr.is_valid()) {
printf("server setup failed\n");
return -1;
}
svr.Get("/", [=](const Request & /*req*/, Response &res) {
res.set_content("POST api is listening on /api\n", "text/plain");
});
svr.Post("/api",
[&](const Request &req, Response &res, const ContentReader &content_reader) {
if (req.is_multipart_form_data()) {
res.set_content("Server does not support multipart form data", "text/html");
res.status = 500;
return;
}
std::string body;
content_reader([&](const char *data, size_t data_length) {
body.append(data, data_length);
return true;
});
// if it's JSON, change the content type to application/json
res.set_content(get_response(body), "text/plain");
});
svr.set_exception_handler([](const Request& req, Response& res, std::exception_ptr ep) {
auto fmt = "<h1>Error 500</h1><p>%s</p>";
char buf[BUFSIZ];
try {
std::rethrow_exception(ep);
} catch (std::exception &e) {
snprintf(buf, sizeof(buf), fmt, e.what());
} catch (...) {
snprintf(buf, sizeof(buf), fmt, "Unknown Exception");
}
res.set_content(buf, "text/html");
res.status = 500;
});
printf("starting server on port %d\n", PORT);
svr.listen("localhost", PORT);
return 0;
} |
Would be awesome, because this would allow pre-prompting and spawning interactive sessions like tinygrads's LLaMa personalities (demo video). |
well he doesn't want any deps so that's why the interfacing is the hard part, otherwise it's pretty ez part.
will be most of the work, at least from there it's easy enough to ghetto hack in anyones own test bed to input stuff into stdin for the program |
Assuming nobody cares about windows it would be possible to allocate the giant buffer required for the model with shm_open to retain the loaded model in memory between executions. That way you could still faff about with the executable/parameters as long as they don't impact how you load the llama model. |
I'm working on adding a sort of interactive mode over in a fork, where (if a flag, It currently looks like this: Would you be interested in a PR once I'm done with some further cleanup and testing? I'm still planning to put the colouring behind another flag, and find a solution for some papercuts (among others, it seems that spaces tend to be part of a single token with words that follow them, so you have to use |
@blackhole89 Edit: if you cannot make it run on Windows, you can Edit2: regarding spaces - the tokenizer is currently broken. Probably this is causing a lot of trouble such cases and also when Unicode characters are present in the input. |
I unfortunately don't have access to a Windows machine to test it on right now. Is there a problem with the availability of signal.h/sigaction there? Either way, at least the "reverse prompt" triggered interaction should work even without the signal handler. Unfortunate about the tokenizer. I guess I will leave the problem untouched for now, hoping for an eventual resolution :) I think I got it to work to approximately the standard I was hoping for (few lingering issues: probably better to communicate limitations such as that the reverse prompt must be token-exact; subsequent user inputs are not counted towards the context size limit), so I'll go ahead and make a PR. |
It would be useful to include short section in the README with instruction how to use interactive mode and a screenshot |
I added a section (and made the PR). Not sure in hindsight if it's the best possible example, since it doesn't show how the usage of \ to submit multiple lines... |
I realised that the slight imprecision in calling London the largest city in (all of) Europe actually biased the entire generation in a less factual direction, so here's another screenshot that doesn't have that issue and also shows off '\' 🙄. Might be good to replace it... I also found that having the high repeat_penalty tended to make it abort the conversation early (rather than repeat User:/Bob:), so the corresponding invocation also had edit:
No idea why the color resets after "Moscow", needs some investigation... |
@blackhole89 Will update the README with the new example |
Great, thanks! I also figured out what was going wrong with the color getting cancelled early; I can make a quick PR for that shortly, though it might be necessary to sit down and think a little bit about code quality (as I'm picking locations in the code to emit the colorcodes that happen to work but aren't particularly related to the lifecycle of tokens as they migrate from input -> embd_inp -> embd -> output). I found that few-shotting it carefully makes a big difference. I would actually recommend adding more than one example interaction into the prompt, but have been lazy about it because my machine isn't doing too well with 13B (probably at around 500ms per token, and spending a considerable amount of time even to do the processing for the user-provided prompt/input tokens - can this be optimised somehow?). |
Yes - this is a very cool task actually. |
@simonw I am working on a fork with some python bindings here: https://github.com/thomasantony/llama.cpp/tree/feature/pybind |
This has been working well for me, using a response from ChatGPT that I shortened by a couple sentences to save more of the context space (oddly, while this works well on my M1 iMac, it gives very wrong looking results on Ubuntu 22.10 using Amd64): ./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 512 --repeat_penalty 1.0 --color -i -r "User:" HAL: Hello. I am HAL. How may I help you today? User: What’s the history of bullfighting in Spain? HAL: Bullfighting, also known as "tauromachia," has a long and storied history in Spain, with roots that can be traced back to ancient civilizations. The sport is believed to have originated in 7th-century BCE Iberian Peninsula as a form of animal worship, and it evolved over time to become a sport and form of entertainment. Bullfighting as it is known today became popular in Spain in the 17th and 18th centuries. During this time, the sport was heavily influenced by the traditions of medieval jousts and was performed by nobles and other members of the upper classes. Over time, bullfighting became more democratized and was performed by people from all walks of life. Bullfighting reached the height of its popularity in the 19th and early 20th centuries and was considered a national symbol of Spain. User:" |
The way to solve this problem is to use mmap(2) which on Windows is called CreateFileMapping() + MapViewOfFileEx() which are available since Vista. I believe using mmap will reduce startup latency to effectively zero. In order to do that, we need to refactor the codebase so that data structures can be directly mappable, without needing to be loaded and constructed. The issue where everyone is encouraged to help us design that is #91. |
Removing the help wanted tag because it's here. That doesn't mean you can't still participate in helping me do it! |
#278 implements a way to do this, I've also added example shell scripts that you can use to spawn llama.cpp in server mode and open multiple clients to it (limitation is how many CPU threads you have to process generation in parallel). |
I wouldn't go as far to assume: Maybe the project should split into non-portable and portable forks one since there's a lot of PR's already which decrease portability. There's already the tcp_server and mmap branches, neither of which are exactly portable. I made the suggestion before that instead of stabbing the main code with non-portable stuff, that implenting a light C API to handle saving/loading the model and its' state in/from/to memory woud allow using whatever platform- or library-dependent solution on top of it , instead of integrating deeply to the main program.
Originally posted by @anzz1 in #278 (comment) @tarruda's counter point was that the current code is not thread-safe, but why exactly is spawning more processes in a non-portable manner ( fork() ) exactly better than just implementing thread-safety? The performance penalty that comes with it could be simply #ifdef THREAD_SAFE , instead of ending up with a main program full of #ifdef's with whatever platform-specific implementations for the other options to do it. If you want only the model to be preloaded in memory and not the context and have the ability of serving multiple input/output streams concurrently, you could simply spawn new threads with their separate states and only have them share the pointer to the preloaded model. If the preloaded model is read-only , you don't need to even implement any thread-safety at all. Thread-safety would only be needed for sharing the context. I don't understand why fork() is needed to accomplish this? Something like this? In the case the context doesn't need to be shared, only the preloaded model. No need for fork() or anything else than just threads (which work in every platform). No need for thread-safety, since all the concurrent accesses are just read operations. The context isn't shared between threads, only the model is.
Now if the context was to be shared, you would need to either copy the context or implement a thread-safe access to it, both of which comes with their caveats, thread-safety sacrifices speed while copying increases memory usage and it would only be shared up to the point it was copied. But both the fork() implementation and mmap() share exactly the same caveats anyway. Why is the overhead of spawning a new process of fork() or the increased complexity of mmap() needed here at all? Please enlighten me if there is something I'm completely missing here, as it seems we're trying to use a chainsaw to cut a matchstick here. With this sort of generic implementation, any non-portable implementation could just use the preload_model, create_context, free_context, serve_instance, evaluate functions however they saw fit, and it could be done outside the main implementation, keeping the main program lean, clean and fast. The stdin and stdout could be piped to wherever the new implementation requires, be it a HTTP server, some node.js implementation, a file on the file system, or even a hardware device like a serial port. Since all the operations are atomic in regards to each other, you could even load multiple models to memory if you wanted. Create threads or don't create threads, whatever. Load multiple models to memory at the same time and run a single evaluation for each one of them and compare the results. All without introducing any increased complexity or lessened portability to the code. With the generic implementation using multiple processes could also be done if that is needed. Just have a module which shares the context and model structs using whatever means you want to, save to disk, mmap to memory, whatever. The whole point is that you could do anything you want with it, without any of it having to go inside the main program but could live as its' own standalone module instead. |
Since a few years ago Microsoft has embraced Linux, and installing Ubuntu on Windows has never been easier with WSL2. You can literally go to the app store and click a button to get a Linux shell that is fully integrated into Windows. That means you can easily run the Linux version inside windows (the tcp_server branch) and consume from a socket in a native win32 app.
I can't give an opinion there since I'm a machine learning newbie and still don't understand how inference is done in language models. But if In any case I can't be of much help to write a cross-platform solution, the only times I wrote C threading/network code that works across Unix/Win32 was using libuv, which is not an option since dependencies are not wanted in this project. Maybe C++ supports cross-platform threads/networking as part of its standard, but that is also outside of my comfort zone. Meanwhile I will just keep rebasing the |
The problem with WSL2 is not about simplicity, it's about performance. While great improvements in speed have been made recently, a virtualization layer simply can never be as efficient as native code is. "Just emulate Linux" is not an answer when the goal is to have native cross-platform compatibility. You can see from the ggml code that it uses the atomic_* functions for thread-safety between threads using a single context. The work has already been done by @ggerganov . It would be a shame to undo this cross-platform compatibility, don't you think? Just by using separate contexts for the additional "main" threads which can serve I/O concurrently should work. The beauty of using C structs in favor of STL also means that anything that is read-only is also inherently thread-safe since all the read operations are atomic at the bare metal level. Currently though the fork() could be replaced with this:
This is essentially what fork() does anyway, but instead of duplicating the whole process (and incurring a performance penalty), only the model would be duplicated instead. I looked through your tcp_server.cpp and it's simple and succinct, I could easily port it for winsock for you and you wouldn't need to worry about it. No external libraries required. So the problem isn't your network/socket code at all, it really doesn't even need much changes at all to work with winsock. The problems are these: edit: |
Fork is very lightweight and almost as efficient as spawning new threads in modern Unixes, I suggest
I agree with this, check this discussion from a couple of days ago: 5c19c70#r105107843 I added the PosixStream class because there's no standard API for wrapping a file descriptor into a C++ stream. Tried to use some non-standard solutions but resulted in failed Mac compilation. |
I checked out the discussion and I agree that the replacement of STL functions with their standard C runtime counterparts is a good idea, especially the part of using standard Since nowadays you can achieve +5GB/s sequential read speeds with NVMe SSD's, the read functions are now the bottleneck and not the disk speeds unlike just a few short years ago. The less abstraction overhead there is between the raw kernel I/O syscalls and the consumer program, the better. There is the effect though that by making the read functions fast enough the less benefit overall will be achieved with preloading the model to memory. Like said, modern SSD's are already very fast and they are only getting faster. About The problem with it was the non-portability of it, as the copy-on-write functionality it uses under the hood to clone the process could be used for just cloning the state and not the whole process, making it portable. However the points I made in earlier are pretty much outdated now since the C API was introduced. I'm not calling any shots here, but I wouldn't oppose having less-portable solutions under the examples, like having a "examples/linux", "examples/mac", "examples/windows" , "examples/android" style folders since there can be valid cases where portability isn't an option. Or have non-portable examples in their own repos, importing the main llama.cpp repo as a git submodule. I think that would be the cleanest solution overall, especially if the solutions need to span over multiple files and thus would clutter up the main repo. The whole point I tried to make was about keeping the main functionality sleek and portable, which it is, and there is now a simple way of interfacing with llama through the C API which anything can be built upon. You're obviously free to do whatever you wish, but I have a suggestion: Since the ability to do exactly what is needed here, sharing the model and its' state, is in the current short-term roadmap , what do you think of the idea of putting the tcp_server functionality on ice for now until that feature is finished. The thing is that I'm also very interested in the tcp_server functionality. I think it has great promise for developing whatever integrations using whatever languages of anyone's choosing because binding a C/C++ module might not be easy in every programming language / development environment, but the ability of connecting to a TCP socket is implemented in pretty much everything out there. Using the shared state after it's completed and threads instead of fork() , it could be made easily portable. It could be done in a single file I would also be interested in working with you on that. You said that you work pretty much exclusively on Linux and aren't too familiar with winsock. I am the opposite side of that coin, working mostly with Windows and am very familiar with winsock. So joining our forces, I'm certain we could make it work very well. You would take care of the linux socket side, while I could implement the same in winsock. Put together, resulting in a portable solution. Food for thought? |
This indicates a real misunderstanding of how virtualization works. Modern Windows is virtualized. WSL2 and Windows itself are both virtualized. The reason why people want WIN32 is because they want to use the MSVC debugger. |
@anzz1 I'm also interested in the tcp_server functionality, which is why I'm rebasing and using it on my private fork (Not doing any updates though, so you can consider it "frozen" for now). Not sure if you followed the discussion in #278, but I'm no longer trying to get it merged since there's no interest in having the stdio abstraction which is required to avoid duplicating IO loop logic. You're free to take the code from my branch and adapt it to use threads and the upcoming model sharing feature. There's nothing special about that code, you'd simply replace If you are serious about implementing a cross-platform tcp server functionality, I highly recommend using an existing win32/unix abstraction like libuv. It is a waste of effort redoing functionality that already exists in lightweight C libraries with very permissive licenses. Just add a libuv dependency and use their API, and you will have an efficient network server that works seamless across win32 and unix. |
Or just use POSIX and Cygwin / Mingw / Cosmo / etc. all have you covered. |
Yeah I followed it when it was current but that discussion is pretty much outdated now. Funny how it feels like ancient history even though its just been a few days. Nature of everything AI related, haha.
I'll take a look at libuv, though for now I don't think it's necessary to use any library for a simple tcp server. There isn't really anything too much to implement the way I'm currently imagining it. Maybe I'll run into a issue which makes me change my mind. Generally I dislike using libraries, but certainly not all of them all bad. I am especially a fan of single-header C/C++ libraries like the awesome stb libraries by nothings. In any case I'll start working on it once the model state sharing change is implemented as the environment to work with becomes more clear.
I'm not going to get into a argument about semantics what is and isn't virtualization as I specifically said virtualization and not emulation. If you are talking about how the Windows kernel is a hypervisor translating syscalls to low-level I/O and the OS itself runs on top of it, yeah I guess you can call Windows virtualized. And sure, it would be theoretically possible to optimize the code in such a way that a I am not saying WSL2 is bad performance wise, nowadays it's actually quite amazing that it comes within spitting distance to running Linux natively. For example, in this benchmark you can see WSL2 achieve 86% performance in Linpack with 5800X. It still has some way to go though, and it isn't fully POSIX compliant yet. But the way things are going, it's probably only going to improve. Unlike everything else Windows, it's one thing which seems to consistently move in the right direction while the native OS itself is going down the drain, lul. In any case, I don't want 86% performance, I want 100%. You cannot tell me to settle for lower performance, simple as that. Might be fine for many, and maybe I value performance and optimization too much in someone else's opinion, but you can do you and I can do me , right? Cygwin / MSYS2 aren't even considered in this conversation. Their performance is god awful. They can be useful for many applications, but for anything where performance is a requirement they are completely out of the question. |
The discussion here seems to have veered waaaay off topic. It would also be great to be able to reset the prompt without having to reload the model. |
Second the above comment. But really I'd like to option to perform lots of potential new generations with the model in-memory. E.g., tweak temperature, parameters, etc. without re-loading the model each time. |
Performance increase of 6-7% on K-type quants (40B model only)
Register `freqs_cis` as non-persistent buffer
Any progress on this? Ability to reset the prompt / context without reloading the whole process/model? |
I made some progress using the The core idea is to add an if statement after the user input part llama.cpp/examples/main/main.cpp Line 804 in 381ee19
like Inside the if statement, remember to
You can also update other parameters in the first step. It should work, but I haven't tested it. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
an issue isn't magically solved just because people stop posting on it |
And do you have a better solution to offer ? |
Hi, are there any plans to implement prompt/context resetting? I have a text summarization task involving 100,000's of distinct, unrelated prompts. Currently, the model has to be reloaded for each prompt, adding significant overhead. It would be nice to have a CLI argument such as |
The
./main
program currently outputs text and then quits.How hard would it be to add a mode where it could stay running and be ready to accept more text piped to standard input?
This could help avoid the overhead of loading the model again every time the script runs.
Maybe it could output the generated text followed by a marker of some sort when it's done, so a wrapping process could see when it's finished and available to send a new prompt for evaluation.
I'm interested in wrapping it in a tiny Python web server to give myself a UI for interacting with the model.
The text was updated successfully, but these errors were encountered: