Proposal: Horizontally-Scalable LLM Requests #232

ai-christianson · 2023-08-31T14:59:42Z

The Problem

Currently, aider appears to have a single-threaded approach to handling LLM requests and responses. This has been enough to make aider operate well at a basic level, but it limits what we can do if we're willing to spend more compute to solve our problems.

When we ask aider to do a programming task, we can potentially trigger many LLM requests, some of which could execute concurrently:

Create a high-level plan based on the user's query
Review the plan and make improvements
Search the codebase for relevant files
Come up with file management operations such as adding new files, removing files, or moving files
Execute unit tests or build steps in order to validate code changes
- Feed the validation results back into the LLM for feedback
- Repeat until valid

Proposal: Asynchronous LLM Requests

1. Put LLM requests into async functions

The first step is to make our LLM completion queries async.

2. Implement limits and checks to manage concurrency

LLMs are expensive to run. Some users will have higher or lower compute budgets.

We can keep this concurrency under control by using queues and workers. Because our workers are simply making HTTP requests, we do not need threads --asyncio is enough. I suggest that we have a queue of LLM requests, and a fixed number of workers.

For instance, if we have an API with a strict limit of 1 request per 10 seconds, we could spawn just one worker and have it sleep for 10 seconds, grab a work item from the queue, and repeat.

If, on the other hand, we have our own GPU cluster running llama with a private openai-compatible API, then we might want to run 8 workers each of which could make 5 requests per second.

2.a. Budgeting

Although we have managed the API-side requirements such as rate limiting, we may still incur higher costs than the user intends to budget for a given task.

To solve this, we could use a feature where we ask for a budget when aider receives a new high level instruction from the user. Each time a worker processes an LLM task from the queue, the associated cost would be accounted for. In conjunction with this, we would want some kind of cost estimation function and a way to scale the computational intensity that the aider agent applies to the task.

Costs could be "scaled down" by using fewer workers, doing fewer review passes, and disabling non-critical computationally-expensive processing steps for that query.

The goal would be to have aider output quality scale based on the budget of the user.

Caveats

Rate Limiting

Some providers have strict rate limits on API requests. We don't want to make more requests than our allowed quota.

Handled by #2 above.

Costs and Budgeting

Costs could spiral out of control.

Handled by #2 above.

The text was updated successfully, but these errors were encountered:

paul-gauthier · 2023-09-05T18:40:49Z

Sounds like an interesting concept. I've thought about this a bit as well.

How would you suggest turning one user request into multiple LLM requests? I'm not sure there's a general way to do that.

mikehearn · 2023-09-21T20:34:59Z

One way to do it is to teach the LLM that it is allowed to write TODO comments, or for some languages like Kotlin, you can call a TODO("implement me") function (it throws at runtime).

This would allow the LLM to defer work if it's not sure it can do it all at once. Then if you detect it emitted such a function call, you go back and ask it to fix that specific TODO, but you can potentially do them in parallel if the expression is being used inside typed functions.

ishaan-jaff · 2023-11-22T16:47:37Z

Hi @achristianson @paul-gauthier LiteLLM now allows you to queue requests. It's built to solve the problem in this issue (would love your feedback if not)

Here's a quick start on using it: Compatible with GPT-4, llama (mentioned in this thread)
docs: https://docs.litellm.ai/docs/routing#queuing-beta

Quick Start

Add Redis credentials in a .env file

REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted

Start litellm server with your model config

$ litellm --config /path/to/config.yaml --use_queue

Here's an example config for gpt-3.5-turbo

config.yaml (This will load balance between OpenAI + Azure endpoints)

model_list: 
  - model_name: gpt-3.5-turbo
    litellm_params: 
      model: gpt-3.5-turbo
      api_key: 
  - model_name: gpt-3.5-turbo
    litellm_params: 
      model: azure/chatgpt-v-2 # actual model name
      api_key: 
      api_version: 2023-07-01-preview
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/

Test (in another window) → sends 100 simultaneous requests to the queue

$ litellm --test_async --num_requests 100

Available Endpoints

/queue/request - Queues a /chat/completions request. Returns a job id.
/queue/response/{id} - Returns the status of a job. If completed, returns the response as well. Potential status's are: queued and finished.

paul-gauthier · 2024-01-22T21:54:59Z

I'm going to close this issue for now, but feel free to add a comment here and I will re-open or file a new issue any time.

paul-gauthier added the enhancement New feature or request label Sep 5, 2023

paul-gauthier closed this as completed Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Horizontally-Scalable LLM Requests #232

Proposal: Horizontally-Scalable LLM Requests #232

ai-christianson commented Aug 31, 2023

paul-gauthier commented Sep 5, 2023

mikehearn commented Sep 21, 2023

ishaan-jaff commented Nov 22, 2023 •

edited

Loading

paul-gauthier commented Jan 22, 2024

Proposal: Horizontally-Scalable LLM Requests #232

Proposal: Horizontally-Scalable LLM Requests #232

Comments

ai-christianson commented Aug 31, 2023

The Problem

Proposal: Asynchronous LLM Requests

1. Put LLM requests into async functions

2. Implement limits and checks to manage concurrency

2.a. Budgeting

Caveats

Rate Limiting

Costs and Budgeting

paul-gauthier commented Sep 5, 2023

mikehearn commented Sep 21, 2023

ishaan-jaff commented Nov 22, 2023 • edited Loading

Quick Start

Available Endpoints

paul-gauthier commented Jan 22, 2024

ishaan-jaff commented Nov 22, 2023 •

edited

Loading