The Large Language Model Gateway (LLMGW) is a API middleware designed to interface with AI models for chat completions. This API provides endpoints to interact with AI models, manage prompts, and retrieve completions, with built-in rate limiting, security, and clustering for scalability.
- AI Model Integration: Send prompts to AI models and get responses.
- Scalable Architecture: Utilizes clustering to fork workers across CPU cores for better performance.
- Rate Limiting: Built-in rate limiting to prevent abuse (1000 requests per 15 minutes).
- Security: Uses Helmet for secure HTTP headers.
- CLI Interaction: Support for clearing the console and exiting the server via command line.
- Verbose Logging: Optional detailed logging of requests and responses.
Option | Description | Default |
---|---|---|
--bindip |
IP address to bind the server to | 127.0.0.1 |
--bindport |
Port to bind the server to | 42069 |
--aihost |
AI model server host | 10.0.0.1 |
--aihostport |
AI model server port | 443 |
--verbose |
Enable verbose logging | false |
To start the server on IP 127.0.0.1
, port 5000
, and connect to AI model at localhost:8000
with verbose logging:
./llmw --bindip 127.0.0.1 --bindport 42069 --aihost 10.0.0.1 --aihostport 443 --verbose
- Headers:
Content-Type: application/json
- Body:
{ "model": "string", // (Optional) Model ID, default: "TheBloke/Mistral-7B-Instruct-v0.2-AWQ" "messages": [ { "role": "string", // (Required) Role of the message (user/system/assistant) "content": "string" // (Required) Content of the message } ], "max_tokens": 128, // (Optional) Maximum number of tokens in the response "temperature": 0.7 // (Optional) Sampling temperature (0.1 - 1.0) }
-
Status 200:
{ "id": "string", // Unique ID for the completion "object": "chat.completion", // Response type "created": 1636107200, // Timestamp (Unix epoch) "model": "string", // Model ID used "choices": [ { "message": { "role": "user", "content": "string" // AI response content }, "finish_reason": "stop", "index": 0 } ], "usage": { "prompt_tokens": 123, "completion_tokens": 45, "total_tokens": 168 } }
-
Error (400 Bad Request):
{ "error": { "message": "Temperature must be between 0.1 and 1.0", "type": "invalid_request_error" } }
-
Testing the API Locally: You can use tools like Postman or curl to test the API.
Example request using
curl
:curl -X POST http://localhost:42069/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "TheBloke/Mistral-7B-Instruct-v0.2-AWQ", "messages": [{"role": "user", "content": "Hello AI"}], "max_tokens": 100, "temperature": 0.7 }'
-
Logging: If the
--verbose
flag is enabled, detailed logs will appear in the console, showing received inputs and AI responses with ANSI color coding.
The server uses clustering to distribute requests across multiple CPU cores. By default, it forks workers equal to the number of available CPU cores.
If you need to change this behavior, you can modify the clustering logic in the code.
This API uses Helmet for security by adding HTTP headers that prevent some common attacks such as cross-site scripting (XSS) and clickjacking. It also implements rate-limiting to prevent abuse.
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a pull request or create an issue if you encounter any problems.
This README.md
covers installation, configuration, usage, and development guidelines for users and contributors. It is designed to be placed on a GitHub repository to assist developers in deploying and interacting with the API.