add benchmark apibank, gorilla, nexus #1136
Open
+421,979
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've added the APIbank APIbench and Nexus benchmark, main method see benchmark test and utils folder (benchmark_base.py)
There're some problem to be solved for the APIBank, APIbench(gorilla) and Nexus benchmark. listed as below.
For Nexus:
run python nexus_test.py. You'll get error
1.OpenAI limits the size of the function passed into the function call api (function name, function description length, number of functions, etc.). You need to add judgment logic in Camel. If OpenAI does not allow function call, use structure output instead.
2.Critical: while true bug in camel.chatagent.step. When the incoming api is not executed correctly, while true will not terminate.The while true logic should be eliminated. You cannot assume that the function passed by the user will always be executed correctly.
For APIbench
There're three datasets 'torchhub', 'tensorhub', 'huggingface’ . "torchhub"works well. BUT
3.'tensorhub', 'huggingface’ could not be correctly evaluted by the ast matching program. This is a problem within the original repo. I have already proposed an issue. [(https://github.com/ShishirPatil/gorilla/issues/729)]
It could be version problem of tree_sitter, but if you don't use tree_sitter==0.20.4, you'll get an another bug.
For APIbank
There're three datasets 'level1', 'level2', 'level3’ . BUT
4.NO ONE knows how to eveluate 'level3'. See the issue in original repo:
[https://github.com/AlibabaResearch/DAMO-ConvAI/issues/167]
[https://github.com/AlibabaResearch/DAMO-ConvAI/issues/102]
[https://github.com/AlibabaResearch/DAMO-ConvAI/issues/114]
5.APIbank involves multiple "User-Assistant-System" messages as History Records. Camel ChatAgent does not support adding multiple rounds of system messages yet. Temporary solution: Use record_message and make_assistant_message instead of system messages.
6.The version conflict between openai in camel, Https, and Google translate in original repo, see
[https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank#demo]. Camel, Https and Google translate lib doesn't work together.
For now two way works:
-use original repo without camel, Google translate and Https works well.
-use camel, remove Google translate, it works but without Google translate tool.
See:
[https://github.com/microsoft/TaskWeaver/issues/172]
7.Some datasets need to be hosted on GitHub/HuggingFace. The original author did not do this, but we do not want to include these data in Camel's GitHub.