-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Horizontally-Scalable LLM Requests #232
Comments
Sounds like an interesting concept. I've thought about this a bit as well. How would you suggest turning one user request into multiple LLM requests? I'm not sure there's a general way to do that. |
One way to do it is to teach the LLM that it is allowed to write TODO comments, or for some languages like Kotlin, you can call a This would allow the LLM to defer work if it's not sure it can do it all at once. Then if you detect it emitted such a function call, you go back and ask it to fix that specific TODO, but you can potentially do them in parallel if the expression is being used inside typed functions. |
Hi @achristianson @paul-gauthier LiteLLM now allows you to queue requests. It's built to solve the problem in this issue (would love your feedback if not) Here's a quick start on using it: Compatible with GPT-4, llama (mentioned in this thread) Quick Start
REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted
$ litellm --config /path/to/config.yaml --use_queue Here's an example config for config.yaml (This will load balance between OpenAI + Azure endpoints) model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/chatgpt-v-2 # actual model name
api_key:
api_version: 2023-07-01-preview
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
$ litellm --test_async --num_requests 100 Available Endpoints
|
I'm going to close this issue for now, but feel free to add a comment here and I will re-open or file a new issue any time. |
The Problem
Currently,
aider
appears to have a single-threaded approach to handling LLM requests and responses. This has been enough to make aider operate well at a basic level, but it limits what we can do if we're willing to spend more compute to solve our problems.When we ask aider to do a programming task, we can potentially trigger many LLM requests, some of which could execute concurrently:
Proposal: Asynchronous LLM Requests
1. Put LLM requests into async functions
The first step is to make our LLM completion queries async.
2. Implement limits and checks to manage concurrency
LLMs are expensive to run. Some users will have higher or lower compute budgets.
We can keep this concurrency under control by using queues and workers. Because our workers are simply making HTTP requests, we do not need threads --asyncio is enough. I suggest that we have a queue of LLM requests, and a fixed number of workers.
For instance, if we have an API with a strict limit of 1 request per 10 seconds, we could spawn just one worker and have it sleep for 10 seconds, grab a work item from the queue, and repeat.
If, on the other hand, we have our own GPU cluster running llama with a private openai-compatible API, then we might want to run 8 workers each of which could make 5 requests per second.
2.a. Budgeting
Although we have managed the API-side requirements such as rate limiting, we may still incur higher costs than the user intends to budget for a given task.
To solve this, we could use a feature where we ask for a budget when
aider
receives a new high level instruction from the user. Each time a worker processes an LLM task from the queue, the associated cost would be accounted for. In conjunction with this, we would want some kind of cost estimation function and a way to scale the computational intensity that theaider
agent applies to the task.Costs could be "scaled down" by using fewer workers, doing fewer review passes, and disabling non-critical computationally-expensive processing steps for that query.
The goal would be to have
aider
output quality scale based on the budget of the user.Caveats
Rate Limiting
Some providers have strict rate limits on API requests. We don't want to make more requests than our allowed quota.
Handled by #2 above.
Costs and Budgeting
Costs could spiral out of control.
Handled by #2 above.
The text was updated successfully, but these errors were encountered: