Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS)
🤗 Hugging Face Dataset • 📃 Paper
STBench is a benchmark to evaluate the ability of large language models in spatio-temporal analysis. This benchmark consists of 15 distinct tasks and over 70,000 question-answer pairs, covering four dimensions: knowledge comprehension, spatio-temporal reasoning, accurate computation and downstream applications.
All data samples in STBench are in the form of text completion. An instance is as follows:
Question: Below is the coordinate information and related comments of a point of interest: ... Please answer the category of this point of interest.
Options: (1) xxxx, (2) xxxx, (3) xxxx, ...
Please answer one option.
Answer: The answer is option (
The model is expected to complete the text, i.e., it should generate an option number. Therefore, to benchmark a model with STBench, it is necessary to use a text completion API rather than a chat completion API. For chatting models that only provide chat completion API, we suggest instructing the models to complete the text through the system prompt:
[{"role": "system", "content": "you are a helpful text completion assistant. Please continue writing the text entered by the human."}, {"role": "human", "content": "Question: Below is the coordinate information and related comments of a point of interest: ... Please answer the category of this point of interest.\nOptions: (1) xxxx, (2) xxxx, (3) xxxx, ...\nPlease answer one option.\nAnswer: The answer is option ("}]
We have benchmarked 13 distinct large language models and here we provide a simple guide to reproduce our experiments.
-
Dependency Installation
Run the following command to install dependencies:
pip install -r requirements.txt
-
Model Downloading
Our experiments about open-source models are based on modelscope and these open-source models can be downloaded by following command:
cd code python downloads_llms.py
-
Basic Prompt
Run the following command to benchmark all models through 15 tasks:
python basic_prompting.py
-
In-Context Learning
Run the following command to evaluate the performance of all models with in-context learning:
python icl_prompting.py
-
Chain-of-Thought Prompting
To conduct experiments with chain-of-thought prompting for all models, run the following command:
python cot_prompting.py
-
Fine-tuning
Run the following command to fine-tune the model and evaluate the fine-tuned model:
python fine_tuning.py
This repository is organized as follows:
Project
|—— LICENSE
|—— overview.png
|—— README.md
|—— requirements.txt
|—— datasets # all datasets can be found in this directory
|—— basic # the main datasets of STBench, consists of over 60,000 QA pairs
|—— icl # two samples for each task to perform two-shot prompting
|—— cot # two samples containing reasoning for each task to perform CoT prompting
|—— sft # training datasets and validation datasets for fine-tuning
|—— code
|—— model_inference # calling the API of each large language model
|—— model_finetuning # fine-tuning code
|—— download_llms.py # downloading open-source models
|—— basic_prompting.py # running experiments with basic prompting
|—— icl_prompting.py # running experiments with icl prompting
|—— cot_prompting.py # running experiments with cot prompting
|—— fine_tuning.py # running experiments with fine-tuning
|—— result_parser.py # code for identifying the final answer of the model
|—— config.py # a declaration of some configuration such as the file path for each task
-
To benchmark a new model, namely NEW_MODEL
a. Write your code for calling the API of this model in
code/model_inference/new_model.py
, and modifycode/model_inference/__init__.py
accordingly.b. Add the model to the model list in
code/basic_prompting.py
-
To include a new dataset, namely
new_dataset.jsonl
, for a task NEW_TASKa. Put your datasets here:
dataset/basic/new_dataset.jsonl
b. Modify
code/result_parser.py
and implement your functionnew_task_parser()
to parse the results from the output of the LLMsc. Modify
code/config.py
to specify the mapping from NEW_TASK to the dataset pathdataset/basic/new_dataset.jsonl
and the mapping from NEW_TASK to the result parsernew_task_parser()
d. Add the task to the task list in
code/basic_prompting.py
- Accuracy and MAE are shown in the following table:
Knowledge Comprehension | Spatio-temporal Reasoning | Accurate Computation | Downstream Applications | ||||||||||||
PCR | PI | URFR | ARD | PTRD | PRRD | TRRD | TI | DD | NAV | TTRA | FP | TAD | TC | TP | |
ChatGPT | 0.7926 | 0.5864 | 0.3978 | 0.8358 | 0.7525 | 0.9240 | 0.0258 | 0.3342 | 0.1698 | 0.4384 | 0.1048 | 37.33 | 0.5382 | 0.4475 | - |
GPT-4o | 0.9588 | 0.7268 | 0.6026 | 0.9656 | - | 0.9188 | 0.1102 | 0.4416 | 0.5434 | 0.7552 | 0.3404 | 43.25 | 0.6016 | - | - |
ChatGLM2 | 0.2938 | 0.5004 | 0.2661 | 0.2176 | 0.2036 | 0.5216 | 0.2790 | 0.5000 | 0.1182 | 0.2924 | 0.1992 | 63.72 | 0.5000 | 0.3333 | 231.2 |
ChatGLM3 | 0.4342 | 0.5272 | 0.2704 | 0.2872 | 0.3058 | 0.8244 | 0.1978 | 0.6842 | 0.1156 | 0.2576 | 0.1828 | 59.24 | 0.5000 | 0.3111 | 224.5 |
Phi-2 | - | 0.5267 | - | 0.2988 | - | - | - | 0.5000 | 0.1182 | 0.2912 | 0.0658 | 34.82 | 0.5000 | 0.3333 | 206.9 |
Llama-2-7B | 0.2146 | 0.4790 | 0.2105 | 0.2198 | 0.2802 | 0.6606 | 0.2034 | 0.5486 | 0.1256 | 0.2774 | 0.2062 | 53.79 | 0.5098 | 0.3333 | 189.3 |
Vicuna-7B | 0.3858 | 0.5836 | 0.2063 | 0.2212 | 0.3470 | 0.7080 | 0.1968 | 0.5000 | 0.1106 | 0.2588 | 0.1728 | 48.19 | 0.5000 | 0.2558 | 188.1 |
Gemma-2B | 0.2116 | 0.5000 | 0.1989 | 0.1938 | 0.4688 | 0.5744 | 0.2014 | 0.5000 | 0.1972 | 0.2592 | 0.2038 | 41.79 | 0.5000 | 0.3333 | 207.7 |
Gemma-7B | 0.4462 | 0.5000 | 0.2258 | 0.2652 | 0.3782 | 0.9044 | 0.1992 | 0.5000 | 0.1182 | 0.3886 | 0.1426 | 31.85 | 0.5000 | 0.3333 | 139.4 |
DeepSeek-7B | 0.2160 | 0.4708 | 0.2071 | 0.1938 | 0.2142 | 0.6424 | 0.1173 | 0.4964 | 0.1972 | 0.3058 | 0.1646 | 56.89 | 0.5000 | 0.3333 | 220.8 |
Falcon-7B | 0.1888 | 0.5112 | 0.1929 | 0.1928 | 0.1918 | 0.4222 | 0.2061 | 0.7072 | 0.1365 | 0.2610 | 0.2124 | 62.52 | 0.5000 | 0.3309 | 3572.8 |
Mistral-7B | 0.3526 | 0.4918 | 0.2168 | 0.3014 | 0.4476 | 0.7098 | 0.0702 | 0.4376 | 0.1182 | 0.3006 | 0.1094 | 42.59 | 0.5000 | 0.3333 | 156.8 |
Qwen-7B | 0.2504 | 0.6795 | 0.2569 | 0.2282 | 0.2272 | 0.5762 | 0.1661 | 0.4787 | 0.1324 | 0.3106 | 0.2424 | 53.49 | 0.5049 | 0.3477 | 205.2 |
Yi-6B | 0.3576 | 0.5052 | 0.2149 | 0.1880 | 0.5536 | 0.8264 | 0.1979 | 0.5722 | 0.1284 | 0.3336 | 0.2214 | 52.03 | 0.5000 | 0.3333 | 156.2 |