STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS)

STBench is a benchmark to evaluate the ability of large language models in spatio-temporal analysis. This benchmark consists of 15 distinct tasks and over 70,000 question-answer pairs, covering four dimensions: knowledge comprehension, spatio-temporal reasoning, accurate computation and downstream applications.

All data samples in STBench are in the form of text completion. An instance is as follows:

Question: Below is the coordinate information and related comments of a point of interest: ... Please answer the category of this point of interest.
Options: (1) xxxx, (2) xxxx, (3) xxxx, ...
Please answer one option.
Answer: The answer is option (

The model is expected to complete the text, i.e., it should generate an option number. Therefore, to benchmark a model with STBench, it is necessary to use a text completion API rather than a chat completion API. For chatting models that only provide chat completion API, we suggest instructing the models to complete the text through the system prompt:

[{"role": "system", "content": "you are a helpful text completion assistant. Please continue writing the text entered by the human."}, {"role": "human", "content": "Question: Below is the coordinate information and related comments of a point of interest: ... Please answer the category of this point of interest.\nOptions: (1) xxxx, (2) xxxx, (3) xxxx, ...\nPlease answer one option.\nAnswer: The answer is option ("}]

Quick Start

We have benchmarked 13 distinct large language models and here we provide a simple guide to reproduce our experiments.

Dependency Installation

Run the following command to install dependencies:
```
pip install -r requirements.txt
```
Model Downloading

Our experiments about open-source models are based on modelscope and these open-source models can be downloaded by following command:
```
cd code
python downloads_llms.py
```
Basic Prompt

Run the following command to benchmark all models through 15 tasks:
```
python basic_prompting.py
```
In-Context Learning

Run the following command to evaluate the performance of all models with in-context learning:
```
python icl_prompting.py
```
Chain-of-Thought Prompting

To conduct experiments with chain-of-thought prompting for all models, run the following command:
```
python cot_prompting.py
```
Fine-tuning

Run the following command to fine-tune the model and evaluate the fine-tuned model:
```
python fine_tuning.py
```

Detailed Usage

This repository is organized as follows:

Project
  |—— LICENSE
  |—— overview.png
  |—— README.md
  |—— requirements.txt
  |—— datasets                  # all datasets can be found in this directory
      |—— basic                 # the main datasets of STBench, consists of over 60,000 QA pairs
      |—— icl                   # two samples for each task to perform two-shot prompting
      |—— cot                   # two samples containing reasoning for each task to perform CoT prompting
      |—— sft                   # training datasets and validation datasets for fine-tuning
  |—— code
      |—— model_inference       # calling the API of each large language model
      |—— model_finetuning      # fine-tuning code
      |—— download_llms.py      # downloading open-source models
      |—— basic_prompting.py    # running experiments with basic prompting
      |—— icl_prompting.py      # running experiments with icl prompting
      |—— cot_prompting.py      # running experiments with cot prompting
      |—— fine_tuning.py        # running experiments with fine-tuning
      |—— result_parser.py      # code for identifying the final answer of the model
      |—— config.py             # a declaration of some configuration such as the file path for each task

To benchmark a new model, namely NEW_MODEL

a. Write your code for calling the API of this model in code/model_inference/new_model.py, and modify code/model_inference/__init__.py accordingly.

b. Add the model to the model list in code/basic_prompting.py
To include a new dataset, namely new_dataset.jsonl, for a task NEW_TASK

a. Put your datasets here: dataset/basic/new_dataset.jsonl

b. Modify code/result_parser.py and implement your function new_task_parser() to parse the results from the output of the LLMs

c. Modify code/config.py to specify the mapping from NEW_TASK to the dataset path dataset/basic/new_dataset.jsonl and the mapping from NEW_TASK to the result parser new_task_parser()

d. Add the task to the task list in code/basic_prompting.py

Experimental Results

Accuracy and MAE are shown in the following table:

	Knowledge Comprehension				Spatio-temporal Reasoning				Accurate Computation		Downstream Applications
	PCR	PI	URFR	ARD	PTRD	PRRD	TRRD	TI	DD	NAV	TTRA	FP	TAD	TC	TP
ChatGPT	0.7926	0.5864	0.3978	0.8358	0.7525	0.9240	0.0258	0.3342	0.1698	0.4384	0.1048	37.33	0.5382	0.4475	-
GPT-4o	0.9588	0.7268	0.6026	0.9656	-	0.9188	0.1102	0.4416	0.5434	0.7552	0.3404	43.25	0.6016	-	-
ChatGLM2	0.2938	0.5004	0.2661	0.2176	0.2036	0.5216	0.2790	0.5000	0.1182	0.2924	0.1992	63.72	0.5000	0.3333	231.2
ChatGLM3	0.4342	0.5272	0.2704	0.2872	0.3058	0.8244	0.1978	0.6842	0.1156	0.2576	0.1828	59.24	0.5000	0.3111	224.5
Phi-2	-	0.5267	-	0.2988	-	-	-	0.5000	0.1182	0.2912	0.0658	34.82	0.5000	0.3333	206.9
Llama-2-7B	0.2146	0.4790	0.2105	0.2198	0.2802	0.6606	0.2034	0.5486	0.1256	0.2774	0.2062	53.79	0.5098	0.3333	189.3
Vicuna-7B	0.3858	0.5836	0.2063	0.2212	0.3470	0.7080	0.1968	0.5000	0.1106	0.2588	0.1728	48.19	0.5000	0.2558	188.1
Gemma-2B	0.2116	0.5000	0.1989	0.1938	0.4688	0.5744	0.2014	0.5000	0.1972	0.2592	0.2038	41.79	0.5000	0.3333	207.7
Gemma-7B	0.4462	0.5000	0.2258	0.2652	0.3782	0.9044	0.1992	0.5000	0.1182	0.3886	0.1426	31.85	0.5000	0.3333	139.4
DeepSeek-7B	0.2160	0.4708	0.2071	0.1938	0.2142	0.6424	0.1173	0.4964	0.1972	0.3058	0.1646	56.89	0.5000	0.3333	220.8
Falcon-7B	0.1888	0.5112	0.1929	0.1928	0.1918	0.4222	0.2061	0.7072	0.1365	0.2610	0.2124	62.52	0.5000	0.3309	3572.8
Mistral-7B	0.3526	0.4918	0.2168	0.3014	0.4476	0.7098	0.0702	0.4376	0.1182	0.3006	0.1094	42.59	0.5000	0.3333	156.8
Qwen-7B	0.2504	0.6795	0.2569	0.2282	0.2272	0.5762	0.1661	0.4787	0.1324	0.3106	0.2424	53.49	0.5049	0.3477	205.2
Yi-6B	0.3576	0.5052	0.2149	0.1880	0.5536	0.8264	0.1979	0.5722	0.1284	0.3336	0.2214	52.03	0.5000	0.3333	156.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

Quick Start

Detailed Usage

Experimental Results

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
code		code
datasets		datasets
LICENSE		LICENSE
README.md		README.md
overview.png		overview.png
requirements.txt		requirements.txt

License

LwbXc/STBench

Folders and files

Latest commit

History

Repository files navigation

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

Quick Start

Detailed Usage

Experimental Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages