Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Horizontally-Scalable LLM Requests #232

Closed
ai-christianson opened this issue Aug 31, 2023 · 4 comments
Closed

Proposal: Horizontally-Scalable LLM Requests #232

ai-christianson opened this issue Aug 31, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@ai-christianson
Copy link

The Problem

Currently, aider appears to have a single-threaded approach to handling LLM requests and responses. This has been enough to make aider operate well at a basic level, but it limits what we can do if we're willing to spend more compute to solve our problems.

When we ask aider to do a programming task, we can potentially trigger many LLM requests, some of which could execute concurrently:

  • Create a high-level plan based on the user's query
  • Review the plan and make improvements
  • Search the codebase for relevant files
  • Come up with file management operations such as adding new files, removing files, or moving files
  • Execute unit tests or build steps in order to validate code changes
    • Feed the validation results back into the LLM for feedback
    • Repeat until valid

Proposal: Asynchronous LLM Requests

1. Put LLM requests into async functions

The first step is to make our LLM completion queries async.

2. Implement limits and checks to manage concurrency

LLMs are expensive to run. Some users will have higher or lower compute budgets.

We can keep this concurrency under control by using queues and workers. Because our workers are simply making HTTP requests, we do not need threads --asyncio is enough. I suggest that we have a queue of LLM requests, and a fixed number of workers.

For instance, if we have an API with a strict limit of 1 request per 10 seconds, we could spawn just one worker and have it sleep for 10 seconds, grab a work item from the queue, and repeat.

If, on the other hand, we have our own GPU cluster running llama with a private openai-compatible API, then we might want to run 8 workers each of which could make 5 requests per second.

2.a. Budgeting

Although we have managed the API-side requirements such as rate limiting, we may still incur higher costs than the user intends to budget for a given task.

To solve this, we could use a feature where we ask for a budget when aider receives a new high level instruction from the user. Each time a worker processes an LLM task from the queue, the associated cost would be accounted for. In conjunction with this, we would want some kind of cost estimation function and a way to scale the computational intensity that the aider agent applies to the task.

Costs could be "scaled down" by using fewer workers, doing fewer review passes, and disabling non-critical computationally-expensive processing steps for that query.

The goal would be to have aider output quality scale based on the budget of the user.

Caveats

Rate Limiting

Some providers have strict rate limits on API requests. We don't want to make more requests than our allowed quota.

Handled by #2 above.

Costs and Budgeting

Costs could spiral out of control.

Handled by #2 above.

@paul-gauthier
Copy link
Collaborator

Sounds like an interesting concept. I've thought about this a bit as well.

How would you suggest turning one user request into multiple LLM requests? I'm not sure there's a general way to do that.

@paul-gauthier paul-gauthier added the enhancement New feature or request label Sep 5, 2023
@mikehearn
Copy link

One way to do it is to teach the LLM that it is allowed to write TODO comments, or for some languages like Kotlin, you can call a TODO("implement me") function (it throws at runtime).

This would allow the LLM to defer work if it's not sure it can do it all at once. Then if you detect it emitted such a function call, you go back and ask it to fix that specific TODO, but you can potentially do them in parallel if the expression is being used inside typed functions.

@ishaan-jaff
Copy link

ishaan-jaff commented Nov 22, 2023

Hi @achristianson @paul-gauthier LiteLLM now allows you to queue requests. It's built to solve the problem in this issue (would love your feedback if not)

Here's a quick start on using it: Compatible with GPT-4, llama (mentioned in this thread)
docs: https://docs.litellm.ai/docs/routing#queuing-beta

Quick Start

  1. Add Redis credentials in a .env file
REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted
  1. Start litellm server with your model config
$ litellm --config /path/to/config.yaml --use_queue

Here's an example config for gpt-3.5-turbo

config.yaml (This will load balance between OpenAI + Azure endpoints)

model_list: 
  - model_name: gpt-3.5-turbo
    litellm_params: 
      model: gpt-3.5-turbo
      api_key: 
  - model_name: gpt-3.5-turbo
    litellm_params: 
      model: azure/chatgpt-v-2 # actual model name
      api_key: 
      api_version: 2023-07-01-preview
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
  1. Test (in another window) → sends 100 simultaneous requests to the queue
$ litellm --test_async --num_requests 100

Available Endpoints

  • /queue/request - Queues a /chat/completions request. Returns a job id.
  • /queue/response/{id} - Returns the status of a job. If completed, returns the response as well. Potential status's are: queued and finished.

@paul-gauthier
Copy link
Collaborator

I'm going to close this issue for now, but feel free to add a comment here and I will re-open or file a new issue any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants