Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(llm): convert function call request for non-funcall OSS model #4711

Merged
merged 52 commits into from
Nov 14, 2024

Conversation

xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Nov 2, 2024

End-user friendly description of the problem this fixes or functionality that this introduces

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR adds a general utility that automatically converts function-calling LLM requests to non-function-calling LLM requests under the hood.

  • This get rid of the need to maintain two set of prompts for (1) function calling and (2) non-function calling models, and will greatly reduce our maintenance burden.
  • Function-calling is now by default ON: Going forward, we only need to iterate on "function calling mode" only, and the "non-function calling backward-compatibility" will automatically happen under the hood.
  • We now curate a list of "supported function calling model" in llm.py based on our evaluation result below:

Evaluation results so far:

  • Claude is still the default to-go model :)
  • Llama 3.1 70B & Gemini-002-pro barely worked in function calling mode, but works much better with non-function calling
  • For OSS: Llama, Qwen, Deepseek are all good options (though with low resolve rate)
image

Link of any specific issues this addresses

Should fix #4865


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:628201b-nikolaik   --name openhands-app-628201b   docker.all-hands.dev/all-hands-ai/openhands:628201b

@xingyaoww xingyaoww changed the title feat(llm): convert function call request to a format acceptable to non-funcall OSS model feat(llm): convert function call request for non-funcall OSS model Nov 2, 2024
@xingyaoww xingyaoww changed the base branch from xw/fn-calling to main November 2, 2024 19:47
@xingyaoww xingyaoww marked this pull request as ready for review November 13, 2024 16:54
xingyaoww and others added 2 commits November 13, 2024 12:47
Co-authored-by: Calvin Smith <email@cjsmith.io>
@xingyaoww xingyaoww requested a review from enyst November 14, 2024 01:33
@enyst
Copy link
Collaborator

enyst commented Nov 14, 2024

This is a great idea and a great PR. We should be doing this, it will be better to clear out that FC/non-FC code...

However, I do have some (unbaked) thoughts. For the sake of clarity I'll express them a bit rough, even though it's hard to be sure of this kind of stuff.

I think those results are interesting because they ... don't match expectations. My intuitions were that

  1. most, not all, but most LLMs that natively support fc would work the same or a bit better
  2. gpt-4o fc results would be below, but visibly in the "same class"
  3. gemini (!) fc results would be below, but visibly in roughly the "same class" (1% ! 👀 🫨)
  4. fc or non-fc, some OSS results would be sort of close

I think a possible explanation of these results, in part, is that we are now doing a fairly extreme version of optimizing prompts for Claude. We're not just "not helping" these other LLMs, we might be prompting in some ways that are "bad" for them.

Just for illustration, a few examples of what I mean:

  • Deepseek with browsing agent can respond correctly, but fail the task: it responded with a message sent with python variables, when our framework expected just a text. Note: it did follow instructions. We just didn't see Claude responding this way to these instructions so our code assumed it won't happen.
  • Gemini falls into this stuck scenario. I find this interesting because I've never seen CodeAct hit that before. That code was literally, IMO, dead code, for months, it was there since monologue agent's times, and I honestly thought to remove it (or keep it for the amazingly precise reason "in theory it's not reaaaaally impossible" 😅).
    IMHO, what this says is that our CodeAct prompts are enough for GPT-4o to not hit that. But they're not appropriate enough for Gemini.

I feel like we may need to consider prompting better and test/respond better for like 3-4 LLMs, of which at least one OSS...

But I don't know if/how we may want to square this circle, to both:

  • try to give at least 3-4 LLMs other than Claude a fair chance to show us what they can do,
  • and in the same time ... I do agree that maintaining 2 sets of prompts is not great... let alone 3-4.

Other note:

  • have we looked at Nous Research Llamas? I would hope we can try it sometime. And/or Llama 3.2 ?

@xingyaoww
Copy link
Collaborator Author

xingyaoww commented Nov 14, 2024

@enyst great questions!

most, not all, but most LLMs that natively support fc would work the same or a bit better

One potential reason i'm seeing is that, most OSS LLM that support function calling, under the hood, is using JSON format: https://docs.together.ai/docs/llama-3-function-calling -- and you know what happens if you are trying to ask LLM to produce code escaped inside JSON :) https://aider.chat/2024/08/14/code-in-json.html

gemini (!) fc results would be below, but visibly in roughly the "same class" (1% ! 👀 🫨)

For Gemini, the bad function calling result is more likely a bug / artifact

It is able to call tools correctly early in the interaction
image

But it starts to add a weird "fields" to the tool call later in the same trajectory 😓
image

I feel like we may need to consider prompting better and test/respond better for like 3-4 LLMs, of which at least one OSS...

We could craft different prompt for different models now, but it feels to me it is (1) time-consuming, (2) hard to guarantee stability -- new models are coming out all the time & it is really hard to craft ONE prompt that works well on all of them.

My inclination now is that, we are optimizing for max(score for score in model_performances), and when it comes to using off-the-shelf model, the best option is probably to use claude's prompt.

But @Jiayi-Pan and I are working on a research project that's about to release in the next month that would allow us to train OSS model on arbitrary OpenHands prompt & specialized on OpenHands tasks - this likely be a more fundamental solution for OSS model IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: GPT4O not using workspace by default.
4 participants