Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM abstraction #420

Open
pmeier opened this issue May 21, 2024 · 5 comments
Open

LLM abstraction #420

pmeier opened this issue May 21, 2024 · 5 comments

Comments

@pmeier
Copy link
Member

pmeier commented May 21, 2024

Feature description

We currently abstract LLMs through ragna.core.Assistant. While this allows users to implement arbitrary assistants, it makes it unnecessarily hard to use LLMs for other tasks in the RAG pipeline. For example, preprocessing a prompt would require a custom implementation of potentially the same LLM used for answering the prompt.

Thus, I propose we implement a Llm base class and add builtin components for all assistants we currently have. Probably under the ragna.llms namespace.

I'm not sure yet whether ragna.core.Assistant should subclass the new Llm base class or if we rather use composition. Open to both.

I'm also not 100% sure what the interface should look like. I would like to see a small survey of other tools in the ecosystem to compare.

Value and/or benefit

Less duplication for users that want to use a more complex pipleline.

Anything else?

No response

@pmeier
Copy link
Member Author

pmeier commented May 21, 2024

Any solution here should also support #421.

@dillonroach dillonroach self-assigned this Jun 11, 2024
@dillonroach
Copy link

WIP: #432
Currently suggests adding chat.generate(), which calls Assistant.generate(), as equivalent to answer without returning a Message and with no sources/logging. This way an Assistant.answer() might call a preprocess routine, that could in turn call generate. In exl2 example streaming uses its own generator, while 'non-streaming' calls generate() and resolves with async list-comp.

This can all be modified to take the list[Message] easily, I'm just not 100% where that render needs to exist - I have a render_prompt() in the exl2 assistant as a very basic example.. but it may belong elsewhere - happy to discuss.

@pmeier
Copy link
Member Author

pmeier commented Jun 19, 2024

Here are my thoughts:

  • We don't need a generate method on the Chat object. That is exclusively meant for asking questions about documents. If users want to use Ragna just to have a wrapper around and LLM, they'll have to instantiate the assistant explicitly.
  • Adding a generate method to the assistant directly is a nice idea, because it avoids a lot of the design choices pitched above, e.g. we don't need to worry about inheritance vs. decomposition. However, this only works if we can assume that every LLM that we want to use for preprocessing can also be used for answering. Otherwise, it makes little sense to have both of these functionalities on one class.
  • Why do we want it on the base class? Do we want to enforce that every assistant can generate arbitrary strings? In the spirit of the feature request, we could just add the generate method to most of our builtin assistants except for example RagnaDemoAssistant. That way users can use the LLMs we have for preprocessing without the need to enforce this for every assistant.
  • The generate method currently only takes a prompt as input. I'd wager a guess for it to be useful during preprocessing we'd need to allow **kwargs that we send to the LLM as well. If we follow my suggestion from above to only add the generate method to the concrete assistants we don't need to standardize this across assistants, which would be a nightmare.
  • The generate method currently returns a string. For our assistants to call this from the answer method, this needs to be a generator or we can't stream.
  • The generate method is currently defined a sync method (def generate), but all of our builtin assistants that could use it are async. We can just change this on a per-assistant basis, but we need to consider UX. I don't want to leave the user guessing whether the method is sync or async and thus whether the result has to be awaited or not. Maybe we define a generate (sync) and agenerate (async)?

@nenb
Copy link
Contributor

nenb commented Jun 19, 2024

this only works if we can assume that every LLM that we want to use for preprocessing can also be used for answering

This seems fine to me. I can't think of any technical argument against this assumption.

I'm generally in favour of extending the Assistant interface with a new method to deal with preprocessing. I don't think that generate is the most intuitive name for such a method, but I'm not strongly against it either.

We would of course need to update the web API to handle this new method (if we were to include it inside of the /answer endpoint (for example), then we would not be able to track input/outputs in the ragna database). But I don't think it should be much work to do this (just implement something similar to /answer for /generate?).

@dillonroach
Copy link

Comments make sense to me;

Re streaming: I'm not against the idea, but what's the use-case for streaming specifically on the generate() part of the API? Most use-cases will be for python->LLM->python type patterns and not really displayed to users - I could maybe see the case if you had a ton of pre-processing that you needed the user to be aware of, but I'd likely just show this as a updating piece of text in UI that's displaying current process (pre-processing/fetching-sources/etc) if anything at all. Most of what is meant to be displayed (and therefor streamed) is in answer and generate is more for the pieces out of view.

Kwargs are for sure needed, that's a good point.

re "Why do we want it on the base class?" that's really design philosophy, so I bow to your vision there; if we have it in base class we at least try to enforce patterns - if a given assistant doesn't need a generate() maybe it just gets implemented as pass or some basic type-correct return. Or - maybe generate is a core piece within answer and answer simply returns a different dtype (Message) with the user choosing if that is also fetching sources or just running generate(). An assistant without need of Answer() could basically just have answer call generate, and then cast it into a Message.(?)

"Generate" is a solid.. 6/10 in my book.. so I'm on board with Nick on that one - placeholder at best, but I don't really love a lot of alternatives either. I just asked gpt4o for ideas and got: inference, evaluate, query, invoke, compute, process, and run. Among a lot of others I like much less. Inference is certainly the most literal, so there's zero question as to what it does, but I might lean toward evaluate, query, process over that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

3 participants