ThaiLLM-Leaderboard Eval Runner

The Thai-LLM Leaderboard 🇹🇭 focuses on standardizing evaluation methods for large language models (LLMs) in the Thai language based on Seacrowd. As part of an open community project, we welcome you to submit new evaluation tasks or models.

Run an Eval

Install

pip install -r requirements.txt

Run Eval

OPENAI_API_KEY=xxx MODEL_NAME=airesearch/LLaMa3-8b-WangchanX-sft-Full sh eval_only.sh

Run Eval (Alternative)

# run with OpenAI API compatible
python evaluation/main_nlu_prompt_batch.py tha $MODEL_NAME 4 https://api.xx.xx/v1 apikey-xxxxx
python evaluation/main_nlg_prompt_batch.py tha $MODEL_NAME 0 4 https://api.xx.xx/v1 apikey-xxxxx
python evaluation/main_llm_judge_batch.py $MODEL_NAME --data ThaiLLM-Leaderboard/mt-bench-thai --base_url https://api.xx.xx/v1 --api_key apikey-xxxxx

Submit Eval Result

RESULTS_REPO=ThaiLLM-Leaderboard/results HF_TOKEN=xxx python scripts/transform_result.py $MODEL_NAME

Develop an Eval

New Dataset Based on the Same Pipeline (NLU, NLG, LLM as Judge)

Edit evaluation/data_utils.py to include your evaluation dataset.
Create a pull request on https://huggingface.co/spaces/ThaiLLM-Leaderboard/leaderboard by adding a result key to human-readable name mapper in leaderboard/read_evals.py.

New Eval Pipeline

Create a file similar to evaluation/main_**_batch.py to run an evaluation and output its results.
Add a method in scripts/transform_result.py to transform the evaluation result into the same format as the example below.

Example Output File After Transform

{
  "config": {
    "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct"
  },
  "results": {
    "xcopa_tha_seacrowd_qa": {
      "accuracy": 0.522
    },
    "wisesight_thai_sentiment_seacrowd_text": {
      "accuracy": 0.4545114189442156
    },
    "belebele_tha_thai_seacrowd_qa": {
      "accuracy": 0.4177777777777778
    },
    "xnli.tha_seacrowd_pairs": {
      "accuracy": 0.3407185628742515
    }
  }
}

Create a pull request on https://huggingface.co/spaces/ThaiLLM-Leaderboard/leaderboard by adding a result key to human-readable name mapper in leaderboard/read_evals.py.

Name		Name	Last commit message	Last commit date
Latest commit History 449 Commits
evaluation		evaluation
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
eval_only.sh		eval_only.sh
requirements.txt		requirements.txt
runner.sh		runner.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ThaiLLM-Leaderboard Eval Runner

Run an Eval

Install

Run Eval

Run Eval (Alternative)

Submit Eval Result

Develop an Eval

New Dataset Based on the Same Pipeline (NLU, NLG, LLM as Judge)

New Eval Pipeline

Example Output File After Transform

About

Releases

Packages

Languages

scb-10x/seacrowd-eval

Folders and files

Latest commit

History

Repository files navigation

ThaiLLM-Leaderboard Eval Runner

Run an Eval

Install

Run Eval

Run Eval (Alternative)

Submit Eval Result

Develop an Eval

New Dataset Based on the Same Pipeline (NLU, NLG, LLM as Judge)

New Eval Pipeline

Example Output File After Transform

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages