xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
benchmark regex reliability evaluation dataset gpt phi large-language-models llm open-compass chatglm qwen lm-evaluation llm-as-a-judge llm-as-evaluator xfinder reliable-evaluation key-answer-extraction judge-model
-
Updated
Oct 28, 2024 - Python