-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Examples for LLM Use Cases with Langchain and Llamaindex #385
Comments
I'm getting 404 not found error using the bellow code from langchain_openai import OpenAI
llm = OpenAI(
base_url=f"http://{INGRESS_HOST}:{INGRESS_PORT}/v1/",
api_key=api_key,
default_headers={
"Host": SERVICE_HOSTNAME,
},
model="MODEL_NAME",
temperature=0.8,
top_p=1,
)
messages = [
(
"system",
"You are a helpful assistant that translates English to French. Translate the user sentence.",
),
("human", "I love programming."),
]
ai_msg = llm.invoke(messages)
ai_msg Kserve version 0.13.1 |
I saw in the README that you are using the Hugging faces runtime. The Open Inference API is working well, but having an OpenAI inference endpoint will be better for the developers. |
Hi @allilou , did you try to remove the |
I'm using this code to make the call from openai import OpenAI
client = OpenAI(
base_url=f"{SERVER_URL}/openai/v1",
)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="my-model-name",
) And I'm getting this error
My custom model is the following (Kserve 0.13) class KserveLLM(kserve.Model):
llm: LLMEngine
def __init__(self, name: str):
super().__init__(name)
self.name = name
self.ready = False
def load(self):
engine_args = EngineArgs(
model=LLM_ID, download_dir=LLM_PATH, dtype="half", enforce_eager=True
)
self.llm = LLMEngine.from_engine_args(engine_args)
def predict(self, request: InferRequest, headers: Dict = None) -> Dict:
"""class method wrapping self.model.predict()
"""
input_query = ""
max_new_tokens = 1000
temperature = 0.01
top_k = 2
top_p = 0.01
for input in request.inputs:
if input.name == "input_text":
input_query = input.data[0]
elif input.name == "temperature":
temperature = input.data[0]
elif input.name == "top_k":
top_k = input.data[0]
elif input.name == "top_p":
top_p = input.data[0]
elif input.name == "max_new_tokens":
max_new_tokens = input.data[0]
if len(input_query) == 0:
error_message = "Empty Query text !!!"
self.logger.warning(f"[LLM]: {error_message}")
return self._error_response(error_message)
self.llm_sampling_params = SamplingParams(
temperature=temperature,
max_tokens=max_new_tokens,
top_k=top_k,
top_p=top_p,
)
self.logger.info(f"[LLM] Query: {input_query}")
self.llm.add_request("0", input_query, self.llm_sampling_params)
output_text = ""
try:
while True:
request_outputs = self.llm.step()
for request_output in request_outputs:
if request_output.finished:
output_text += " ".join(
[o.text for o in request_output.outputs]
)
if not (self.llm.has_unfinished_requests()):
break
except Exception as e:
error_message = "ERROR: Faild to generate predictions !!"
self.logger.error(f"[LLM] {error_message} {str(e)}")
return self._error_response(error_message)
self.logger.info(f"[LLM-OUT]: {output_text}")
response_id = generate_uuid()
infer_output = InferOutput(
name="predictions", shape=[1, 1], datatype="FP32", data=[output_text]
)
infer_response = InferResponse(
model_name=self.name, infer_outputs=[infer_output], response_id=response_id
)
return infer_response
if __name__ == "__main__":
models_list: list[Any] = []
# model servers
llm: KserveLLM = KserveLLM("my-model-name")
try:
llm.load()
models_list.append(llm)
except ModelMissingError:
_logger.error(f"fail to load model [LLM]")
if len(models_list) == 0:
print(f"[NO MODEL TO LOAD]")
exit()
print(
f"[LOADED]: {[type(model).__name__ for model in models_list]}"
)
print(f"{Fore.BLUE}[SERVER]: STARTING")
# Init ModelServer
model_server = ModelServer(http_port=8080, workers=1, enable_docs_url=True)
# Start server
model_server.start(models_list) |
Hi @allilou , you example is using |
You should implement the OpenAIModel chat completion API instead. |
Describe the change you'd like to see
With KServe now supporting the OpenAI Schema for LLM runtimes, few examples of use cases using native Langchain and Llamaindex features like text generation, RAG—QA, chat, etc. with KServe-hosted models
Additional context
Add any other context or screenshots about the feature request here.
Sample call -
Original Issue - kserve/kserve#3419
The text was updated successfully, but these errors were encountered: