Skip to content

LwbXc/STBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS)

🤗 Hugging Face Dataset • 📃 Paper

local file

STBench is a benchmark to evaluate the ability of large language models in spatio-temporal analysis. This benchmark consists of 15 distinct tasks and over 70,000 question-answer pairs, covering four dimensions: knowledge comprehension, spatio-temporal reasoning, accurate computation and downstream applications.

All data samples in STBench are in the form of text completion. An instance is as follows:

Question: Below is the coordinate information and related comments of a point of interest: ... Please answer the category of this point of interest.
Options: (1) xxxx, (2) xxxx, (3) xxxx, ...
Please answer one option.
Answer: The answer is option (

The model is expected to complete the text, i.e., it should generate an option number. Therefore, to benchmark a model with STBench, it is necessary to use a text completion API rather than a chat completion API. For chatting models that only provide chat completion API, we suggest instructing the models to complete the text through the system prompt:

[{"role": "system", "content": "you are a helpful text completion assistant. Please continue writing the text entered by the human."}, {"role": "human", "content": "Question: Below is the coordinate information and related comments of a point of interest: ... Please answer the category of this point of interest.\nOptions: (1) xxxx, (2) xxxx, (3) xxxx, ...\nPlease answer one option.\nAnswer: The answer is option ("}]

Quick Start

We have benchmarked 13 distinct large language models and here we provide a simple guide to reproduce our experiments.

  1. Dependency Installation

    Run the following command to install dependencies:

    pip install -r requirements.txt
  2. Model Downloading

    Our experiments about open-source models are based on modelscope and these open-source models can be downloaded by following command:

    cd code
    python downloads_llms.py
  3. Basic Prompt

    Run the following command to benchmark all models through 15 tasks:

    python basic_prompting.py
  4. In-Context Learning

    Run the following command to evaluate the performance of all models with in-context learning:

    python icl_prompting.py
  5. Chain-of-Thought Prompting

    To conduct experiments with chain-of-thought prompting for all models, run the following command:

    python cot_prompting.py
  6. Fine-tuning

    Run the following command to fine-tune the model and evaluate the fine-tuned model:

    python fine_tuning.py

Detailed Usage

This repository is organized as follows:

Project
  |—— LICENSE
  |—— overview.png
  |—— README.md
  |—— requirements.txt
  |—— datasets                  # all datasets can be found in this directory
      |—— basic                 # the main datasets of STBench, consists of over 60,000 QA pairs
      |—— icl                   # two samples for each task to perform two-shot prompting
      |—— cot                   # two samples containing reasoning for each task to perform CoT prompting
      |—— sft                   # training datasets and validation datasets for fine-tuning
  |—— code
      |—— model_inference       # calling the API of each large language model
      |—— model_finetuning      # fine-tuning code
      |—— download_llms.py      # downloading open-source models
      |—— basic_prompting.py    # running experiments with basic prompting
      |—— icl_prompting.py      # running experiments with icl prompting
      |—— cot_prompting.py      # running experiments with cot prompting
      |—— fine_tuning.py        # running experiments with fine-tuning
      |—— result_parser.py      # code for identifying the final answer of the model
      |—— config.py             # a declaration of some configuration such as the file path for each task      
  1. To benchmark a new model, namely NEW_MODEL

    a. Write your code for calling the API of this model in code/model_inference/new_model.py, and modify code/model_inference/__init__.py accordingly.

    b. Add the model to the model list in code/basic_prompting.py

  2. To include a new dataset, namely new_dataset.jsonl, for a task NEW_TASK

    a. Put your datasets here: dataset/basic/new_dataset.jsonl

    b. Modify code/result_parser.py and implement your function new_task_parser() to parse the results from the output of the LLMs

    c. Modify code/config.py to specify the mapping from NEW_TASK to the dataset path dataset/basic/new_dataset.jsonl and the mapping from NEW_TASK to the result parser new_task_parser()

    d. Add the task to the task list in code/basic_prompting.py

Experimental Results

  • Accuracy and MAE are shown in the following table:
Knowledge Comprehension Spatio-temporal Reasoning Accurate Computation Downstream Applications
PCRPIURFRARDPTRDPRRDTRRDTIDDNAVTTRAFPTADTCTP
ChatGPT 0.7926 0.5864 0.3978 0.8358 0.7525 0.9240 0.0258 0.3342 0.1698 0.4384 0.1048 37.33 0.5382 0.4475 -
GPT-4o 0.9588 0.7268 0.6026 0.9656 - 0.9188 0.1102 0.4416 0.5434 0.7552 0.3404 43.25 0.6016 - -
ChatGLM2 0.2938 0.5004 0.2661 0.2176 0.2036 0.5216 0.2790 0.5000 0.1182 0.2924 0.1992 63.72 0.5000 0.3333 231.2
ChatGLM3 0.4342 0.5272 0.2704 0.2872 0.3058 0.8244 0.1978 0.6842 0.1156 0.2576 0.1828 59.24 0.5000 0.3111 224.5
Phi-2 - 0.5267 - 0.2988 - - - 0.5000 0.1182 0.2912 0.0658 34.82 0.5000 0.3333 206.9
Llama-2-7B 0.2146 0.4790 0.2105 0.2198 0.2802 0.6606 0.2034 0.5486 0.1256 0.2774 0.2062 53.79 0.5098 0.3333 189.3
Vicuna-7B 0.3858 0.5836 0.2063 0.2212 0.3470 0.7080 0.1968 0.5000 0.1106 0.2588 0.1728 48.19 0.5000 0.2558 188.1
Gemma-2B 0.2116 0.5000 0.1989 0.1938 0.4688 0.5744 0.2014 0.5000 0.1972 0.2592 0.2038 41.79 0.5000 0.3333 207.7
Gemma-7B 0.4462 0.5000 0.2258 0.2652 0.3782 0.9044 0.1992 0.5000 0.1182 0.3886 0.1426 31.85 0.5000 0.3333 139.4
DeepSeek-7B 0.2160 0.4708 0.2071 0.1938 0.2142 0.6424 0.1173 0.4964 0.1972 0.3058 0.1646 56.89 0.5000 0.3333 220.8
Falcon-7B 0.1888 0.5112 0.1929 0.1928 0.1918 0.4222 0.2061 0.7072 0.1365 0.2610 0.2124 62.52 0.5000 0.3309 3572.8
Mistral-7B 0.3526 0.4918 0.2168 0.3014 0.4476 0.7098 0.0702 0.4376 0.1182 0.3006 0.1094 42.59 0.5000 0.3333 156.8
Qwen-7B 0.2504 0.6795 0.2569 0.2282 0.2272 0.5762 0.1661 0.4787 0.1324 0.3106 0.2424 53.49 0.5049 0.3477 205.2
Yi-6B 0.3576 0.5052 0.2149 0.1880 0.5536 0.8264 0.1979 0.5722 0.1284 0.3336 0.2214 52.03 0.5000 0.3333 156.2

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages