Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Mode insted of JSON mode. #1244

Open
qdrddr opened this issue Jan 27, 2025 · 0 comments
Open

Python Mode insted of JSON mode. #1244

qdrddr opened this issue Jan 27, 2025 · 0 comments

Comments

@qdrddr
Copy link
Contributor

qdrddr commented Jan 27, 2025

Is your feature request related to a problem? Please describe.
LLMs were trained with a much larger Python code corpus of text and less in JSON, therefore the quality is better with Python.

Describe the solution you'd like
Here is how it works instead of asking LLM to produce a valid JSON response, you tell LLM to make Python code with the response, and then you need to execute this Python code and extract your results.

Describe alternatives you've considered
JSON-Mode

Additional context
This request is inspired by HuggingFace SmolAgents.
For this, you'll need a safe Python execution environment such as https://github.com/e2b-dev/E2B

JSON vs. Python benchmarking
Subjectively it looks like JSON fluctuates around 88% accuracy, while Python is around 91%.
JSON Benchmarks: Tool Use (BFCL), MMLU (including tool use and function calling), MT-Bench
Python Benchmarks: Code (HumanEval-python), HumanEval+, HumanEval-Pro, LiveCodeBench (Pass®1-COT), Codeforces (Percentile/Rating), SWE Verified (Resolved), Aider-Polyglot (Acc.)

https://blog.langchain.dev/extraction-benchmarking/

https://www.promptlayer.com/blog/llm-benchmarks-a-comprehensive-guide-to-ai-model-evaluation
https://llm-stats.com/
https://paperswithcode.com/sota/code-generation-on-mbpp
https://paperswithcode.com/sota/code-generation-on-humaneval
https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
https://anonymous.4open.science/r/PythonSaga/README.md
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/
https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard
https://github.com/OpenBMB/ToolBench?tab=readme-ov-file#-model-experiments-results
https://gorilla.cs.berkeley.edu/leaderboard.html#leaderboard
https://answers111.github.io/evalpro.github.io/leaderboard.html
https://evalplus.github.io/repoqa.html
https://evalplus.github.io/repoqa.html
https://evalplus.github.io/evalperf.html
https://github.com/svilupp/Julia-LLM-Leaderboard?tab=readme-ov-file#julia-llm-leaderboard
https://huggingface.co/spaces/qiantong-xu/toolbench-leaderboard
https://scale.com/leaderboard/coding
https://huggingface.co/spaces/opencompass/opencompass-llm-leaderboard
https://aider.chat/docs/leaderboards/
https://www.swebench.com/#verified
https://codeforces.com/profile/Leaderboard?graphType=all
https://livecodebench.github.io/leaderboard.html
https://www.llmcodearena.com/top-models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant