You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
LLMs were trained with a much larger Python code corpus of text and less in JSON, therefore the quality is better with Python.
Describe the solution you'd like
Here is how it works instead of asking LLM to produce a valid JSON response, you tell LLM to make Python code with the response, and then you need to execute this Python code and extract your results.
Describe alternatives you've considered
JSON-Mode
Additional context
This request is inspired by HuggingFace SmolAgents.
For this, you'll need a safe Python execution environment such as https://github.com/e2b-dev/E2B
JSON vs. Python benchmarking
Subjectively it looks like JSON fluctuates around 88% accuracy, while Python is around 91%.
JSON Benchmarks: Tool Use (BFCL), MMLU (including tool use and function calling), MT-Bench
Python Benchmarks: Code (HumanEval-python), HumanEval+, HumanEval-Pro, LiveCodeBench (Pass®1-COT), Codeforces (Percentile/Rating), SWE Verified (Resolved), Aider-Polyglot (Acc.)
Is your feature request related to a problem? Please describe.
LLMs were trained with a much larger Python code corpus of text and less in JSON, therefore the quality is better with Python.
Describe the solution you'd like
Here is how it works instead of asking LLM to produce a valid JSON response, you tell LLM to make Python code with the response, and then you need to execute this Python code and extract your results.
Describe alternatives you've considered
JSON-Mode
Additional context
This request is inspired by HuggingFace SmolAgents.
For this, you'll need a safe Python execution environment such as https://github.com/e2b-dev/E2B
JSON vs. Python benchmarking
Subjectively it looks like JSON fluctuates around 88% accuracy, while Python is around 91%.
JSON Benchmarks: Tool Use (BFCL), MMLU (including tool use and function calling), MT-Bench
Python Benchmarks: Code (HumanEval-python), HumanEval+, HumanEval-Pro, LiveCodeBench (Pass®1-COT), Codeforces (Percentile/Rating), SWE Verified (Resolved), Aider-Polyglot (Acc.)
https://blog.langchain.dev/extraction-benchmarking/
https://www.promptlayer.com/blog/llm-benchmarks-a-comprehensive-guide-to-ai-model-evaluation
https://llm-stats.com/
https://paperswithcode.com/sota/code-generation-on-mbpp
https://paperswithcode.com/sota/code-generation-on-humaneval
https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
https://anonymous.4open.science/r/PythonSaga/README.md
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/
https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard
https://github.com/OpenBMB/ToolBench?tab=readme-ov-file#-model-experiments-results
https://gorilla.cs.berkeley.edu/leaderboard.html#leaderboard
https://answers111.github.io/evalpro.github.io/leaderboard.html
https://evalplus.github.io/repoqa.html
https://evalplus.github.io/repoqa.html
https://evalplus.github.io/evalperf.html
https://github.com/svilupp/Julia-LLM-Leaderboard?tab=readme-ov-file#julia-llm-leaderboard
https://huggingface.co/spaces/qiantong-xu/toolbench-leaderboard
https://scale.com/leaderboard/coding
https://huggingface.co/spaces/opencompass/opencompass-llm-leaderboard
https://aider.chat/docs/leaderboards/
https://www.swebench.com/#verified
https://codeforces.com/profile/Leaderboard?graphType=all
https://livecodebench.github.io/leaderboard.html
https://www.llmcodearena.com/top-models
The text was updated successfully, but these errors were encountered: