feat(llm): convert function call request for non-funcall OSS model #4711

xingyaoww · 2024-11-02T19:47:05Z

End-user friendly description of the problem this fixes or functionality that this introduces

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR adds a general utility that automatically converts function-calling LLM requests to non-function-calling LLM requests under the hood.

This get rid of the need to maintain two set of prompts for (1) function calling and (2) non-function calling models, and will greatly reduce our maintenance burden.
Function-calling is now by default ON: Going forward, we only need to iterate on "function calling mode" only, and the "non-function calling backward-compatibility" will automatically happen under the hood.
We now curate a list of "supported function calling model" in llm.py based on our evaluation result below:

Evaluation results so far:

Claude is still the default to-go model :)
Llama 3.1 70B & Gemini-002-pro barely worked in function calling mode, but works much better with non-function calling
For OSS: Llama, Qwen, Deepseek are all good options (though with low resolve rate)

Link of any specific issues this addresses

Should fix #4865

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:628201b-nikolaik   --name openhands-app-628201b   docker.all-hands.dev/all-hands-ai/openhands:628201b

…lling-oss

openhands/llm/fn_call_converter.py

…lling-oss

evaluation/utils/shared.py

Co-authored-by: Calvin Smith <email@cjsmith.io>

enyst · 2024-11-14T16:07:32Z

This is a great idea and a great PR. We should be doing this, it will be better to clear out that FC/non-FC code...

However, I do have some (unbaked) thoughts. For the sake of clarity I'll express them a bit rough, even though it's hard to be sure of this kind of stuff.

I think those results are interesting because they ... don't match expectations. My intuitions were that

most, not all, but most LLMs that natively support fc would work the same or a bit better
gpt-4o fc results would be below, but visibly in the "same class"
gemini (!) fc results would be below, but visibly in roughly the "same class" (1% ! 👀 🫨)
fc or non-fc, some OSS results would be sort of close

I think a possible explanation of these results, in part, is that we are now doing a fairly extreme version of optimizing prompts for Claude. We're not just "not helping" these other LLMs, we might be prompting in some ways that are "bad" for them.

Just for illustration, a few examples of what I mean:

Deepseek with browsing agent can respond correctly, but fail the task: it responded with a message sent with python variables, when our framework expected just a text. Note: it did follow instructions. We just didn't see Claude responding this way to these instructions so our code assumed it won't happen.
Gemini falls into this stuck scenario. I find this interesting because I've never seen CodeAct hit that before. That code was literally, IMO, dead code, for months, it was there since monologue agent's times, and I honestly thought to remove it (or keep it for the amazingly precise reason "in theory it's not reaaaaally impossible" 😅).
IMHO, what this says is that our CodeAct prompts are enough for GPT-4o to not hit that. But they're not appropriate enough for Gemini.

I feel like we may need to consider prompting better and test/respond better for like 3-4 LLMs, of which at least one OSS...

But I don't know if/how we may want to square this circle, to both:

try to give at least 3-4 LLMs other than Claude a fair chance to show us what they can do,
and in the same time ... I do agree that maintaining 2 sets of prompts is not great... let alone 3-4.

Other note:

have we looked at Nous Research Llamas? I would hope we can try it sometime. And/or Llama 3.2 ?

xingyaoww · 2024-11-14T16:25:56Z

@enyst great questions!

most, not all, but most LLMs that natively support fc would work the same or a bit better

One potential reason i'm seeing is that, most OSS LLM that support function calling, under the hood, is using JSON format: https://docs.together.ai/docs/llama-3-function-calling -- and you know what happens if you are trying to ask LLM to produce code escaped inside JSON :) https://aider.chat/2024/08/14/code-in-json.html

gemini (!) fc results would be below, but visibly in roughly the "same class" (1% ! 👀 🫨)

For Gemini, the bad function calling result is more likely a bug / artifact

It is able to call tools correctly early in the interaction

But it starts to add a weird "fields" to the tool call later in the same trajectory 😓

I feel like we may need to consider prompting better and test/respond better for like 3-4 LLMs, of which at least one OSS...

We could craft different prompt for different models now, but it feels to me it is (1) time-consuming, (2) hard to guarantee stability -- new models are coming out all the time & it is really hard to craft ONE prompt that works well on all of them.

My inclination now is that, we are optimizing for max(score for score in model_performances), and when it comes to using off-the-shelf model, the best option is probably to use claude's prompt.

But @Jiayi-Pan and I are working on a research project that's about to release in the next month that would allow us to train OSS model on arbitrary OpenHands prompt & specialized on OpenHands tasks - this likely be a more fundamental solution for OSS model IMO.

xingyaoww changed the title ~~feat(llm): convert function call request to a format acceptable to non-funcall OSS model~~ feat(llm): convert function call request for non-funcall OSS model Nov 2, 2024

xingyaoww changed the base branch from xw/fn-calling to main November 2, 2024 19:47

xingyaoww added 2 commits November 2, 2024 15:49

remove obs prefix

60c7b11

add initial implementation of fn call converter

239ebc4

xingyaoww force-pushed the xw/fn-calling-oss branch from d6e1920 to 239ebc4 Compare November 2, 2024 19:49

xingyaoww and others added 18 commits November 2, 2024 20:08

add test that handles an inference pipeline

375e4fd

make codeact function calling by default

d44bb3a

tweak test

751b012

implement msg conversion for fncall

b5ed70c

add debug print

a9b11d7

add chmod after code copy

7532db8

handle the case for parallel function calling

7a6a32c

go back to only ONE tool call support

69e73d5

add conversion script from multiple messages to single

137b30a

Merge commit 'ba25b02978dac9da920a8b4bff2deee739ae988f' into xw/fn-ca…

f66ee8b

…lling-oss

add incontext learning example for fn calling conversion

1a8f1f4

add stop word

7187b8e

Merge commit '145194c87bafc3ba1041c87d66b2b0bee61061f5' into xw/fn-ca…

ac346f9

…lling-oss

add prefix and suffix

8186917

handle none content

e364c67

explictly display all the parameters

e56ec22

fix bug (and dpsk output)

a3e191c

fix dpsk again

11b7910

xingyaoww mentioned this pull request Nov 7, 2024

(feat): Prompt engineering to remind o1 to generate a patch #4807

Merged

xingyaoww added 2 commits November 7, 2024 16:03

fix runtime.connect that cause swebench workspace to fail

d50a929

convert response message to Litellm message structure

910c046

enyst reviewed Nov 7, 2024

View reviewed changes

openhands/llm/fn_call_converter.py Show resolved Hide resolved

enyst reviewed Nov 7, 2024

View reviewed changes

openhands/llm/fn_call_converter.py Show resolved Hide resolved

xingyaoww added 2 commits November 7, 2024 19:23

check type for fncall response message

36063b7

add trailing </function>

71785f2

xingyaoww added 4 commits November 8, 2024 19:43

remote runtime tweak

0c44656

remove gpt4o from fncall

a4c67f0

we only use pre-determined supported list for fncall

3d52799

print fc call message in info level

d20641b

xingyaoww mentioned this pull request Nov 9, 2024

[Bug]: GPT4O not using workspace by default. #4865

Closed

1 task

xingyaoww and others added 9 commits November 9, 2024 19:31

revert remote runtime

2e40f1f

add really large timeout for eval specifically

597ceec

bump timeout for eval infer as well

87baf51

catch extra warnings from litellm

b743c40

bump up litellm

789c632

do not add stuff all other model except anthropic

cf672fc

increase init timeout for eval

ddb8f01

Merge commit 'f55ddbed0eba5aaf1a75d1e72230bc9cea6c4569' into xw/fn-ca…

60f08d2

…lling-oss

add gpt4o to fncall supported model

6291bc5

xingyaoww marked this pull request as ready for review November 13, 2024 16:54

fix test

3687d7b

xingyaoww requested review from neubig, rbren and csmith49 November 13, 2024 17:01

csmith49 reviewed Nov 13, 2024

View reviewed changes

evaluation/utils/shared.py Outdated Show resolved Hide resolved

xingyaoww and others added 2 commits November 13, 2024 12:47

fix tests

686cb7e

Update evaluation/utils/shared.py

628201b

Co-authored-by: Calvin Smith <email@cjsmith.io>

xingyaoww requested a review from enyst November 14, 2024 01:33

rbren approved these changes Nov 14, 2024

View reviewed changes

xingyaoww merged commit 07f0d1c into main Nov 14, 2024
14 checks passed

xingyaoww deleted the xw/fn-calling-oss branch November 14, 2024 16:40

xingyaoww mentioned this pull request Nov 14, 2024

[Bug]: litellm doesn't support function calling model from OpenRouter. bug cause codeactagent couldn't interact with internet solely without ask browser agent for help #4820

Open

1 task

enyst mentioned this pull request Nov 23, 2024

[Bug]: Ollama failing to set context and use tools #5166

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): convert function call request for non-funcall OSS model #4711

feat(llm): convert function call request for non-funcall OSS model #4711

xingyaoww commented Nov 2, 2024 •

edited by github-actions bot

Loading

enyst commented Nov 14, 2024

xingyaoww commented Nov 14, 2024 •

edited

Loading

feat(llm): convert function call request for non-funcall OSS model #4711

feat(llm): convert function call request for non-funcall OSS model #4711

Conversation

xingyaoww commented Nov 2, 2024 • edited by github-actions bot Loading

enyst commented Nov 14, 2024

xingyaoww commented Nov 14, 2024 • edited Loading

xingyaoww commented Nov 2, 2024 •

edited by github-actions bot

Loading

xingyaoww commented Nov 14, 2024 •

edited

Loading