Skip to content

The model, data and code for the visual GUI Agent SeeClick

License

Notifications You must be signed in to change notification settings

njucckevin/SeeClick

Repository files navigation

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

arXiv Maintenance PR's Welcome Awesome

The model, data, and code for the paper: SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Release Plans:

  • GUI grounding benchmark: ScreenSpot
  • Data for the GUI grounding Pre-training of SeeClick
  • Inference code & model checkpoint
  • Other code and resources
  • Code for pre-training and evaluation on ScreenSpot
  • Code for collecting pre-training data

News: SeeClick is accepted by ACL 2024.


GUI Grounding Benchmark: ScreenSpot

ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget). See details and more examples in our paper.

Download the images and annotations of ScreenSpot (or download with Google Drive).

Each test sample contain:

  • img_filename: the interface screenshot file
  • instruction: human instruction
  • bbox: the bounding box of the target element corresponding to instruction
  • data_type: "icon"/"text", indicates the type of the target element
  • data_souce: interface platform, including iOS, Android, macOS, Windows and Web (Gitlab, Shop, Forum and Tool)

Examples of ScreenSpot

Evaluation Results

LVLMs Model Size GUI Specific Mobile Text Mobile Icon/Widget Desktop Text Desktop Icon/Widget Web Text Web Icon/Widget Average
MiniGPT-v2 7B 8.4% 6.6% 6.2% 2.9% 6.5% 3.4% 5.7%
Qwen-VL 9.6B 9.5% 4.8% 5.7% 5.0% 3.5% 2.4% 5.2%
GPT-4V - 22.6% 24.5% 20.2% 11.8% 9.2% 8.8% 16.2%
Fuyu 8B 41.0% 1.3% 33.0% 3.6% 33.9% 4.4% 19.5%
CogAgent 18B 67.0% 24.0% 74.2% 20.0% 70.4% 28.6% 47.4%
SeeClick 9.6B 78.0% 52.0% 72.2% 30.0% 55.7% 32.5% 53.4%

GUI Grounding Pre-training Data for SeeClick

Check data for the GUI grounding pre-training datasets, including the first open source large-scale web GUI grounding corpus collected from Common Crawl.


Inference code & model checkpoint

SeeClick is built on Qwen-VL and is compatible with its Transformers 🤗 inference code.

All you need is to input a few lines of codes as the examples below.

Before running, set up the environment and install the required packages.

pip install -r requirements.txt

Note: If you want to fine-tune the model, you should follow the setup and install with requirements_agent.txt.

Then,

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("SeeClick-ckpt-dir", device_map="cuda", trust_remote_code=True, bf16=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)

img_path = "assets/test_img.png"
prompt = "In this UI screenshot, what is the position of the element corresponding to the command \"{}\" (with point)?"
# prompt = "In this UI screenshot, what is the position of the element corresponding to the command \"{}\" (with bbox)?"  # Use this prompt for generating bounding box
ref = "add an event"   # response (0.17,0.06)
ref = "switch to Year"   # response (0.59,0.06)
ref = "search for events"   # response (0.82,0.06)
query = tokenizer.from_list_format([
    {'image': img_path}, # Either a local path or an url
    {'text': prompt.format(ref)},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

The SeeClick's checkpoint can be downloaded on huggingface. Please replace the SeeClick-ckpt-dir with the actual checkpoint dir.

The prediction output represents the point of (x, y) or the bounding box of (left, top, right, down), each value is a [0, 1] decimal number indicating the ratio of the corresponding position to the width or height of the image. We recommend using point for prediction because SeeClick is mainly trained for predicting click points on GUIs.

Thanks to Qwen-VL for their powerful model and wonderful open-sourced work.


Downstream Agent Task

Check here to get details of training and testing on three downstream agent tasks, which also provides a guideline for fine-tuning SeeClick.

bash finetune/finetune_lora_ds.sh --save-name SeeClick_test --max-length 704 --micro-batch-size 4 --save-interval 500 
    --train-epochs 10 --nproc-per-node 2 --data-path xxxx/data_sft.json --learning-rate 3e-5 
    --gradient-accumulation-steps 8 --qwen-ckpt xxxx/Qwen-VL-Chat --pretrain-ckpt xxxx/SeeClick-pretrain
    --save-path xxxx/checkpoint_qwen
  • data-path: generated sft data, the format can be found in here
  • qwen-ckpt: origin Qwen-VL ckpt path for loading tokenizer
  • pretrain-ckpt: base model for fine-tuning, e.g. SeeClick-pretrain or Qwen-VL
  • save-path: directory to save training checkpoints

The fine-tuning scripts are similar to Qwen-VL, except for we use LoRA to fine-tune customized parameters, as in finetune/finetune.py lines 315-327. This scripts fine-tune pre-train LVLM with LoRA and multi-GPU training; for more option like full-finetuning, Q-LoRA and single-GPU training, please refer to Qwen-VL.


Pre-training and Evaluation on ScreenSpot

You can easily organize the above data yourself for model training and testing on ScreenSpot. As an alternative, we provide a set of scripts used for data processing, pre-training, and testing on ScreenSpot.

cd pretrain

Data Processing for Pre-Training

python pretrain_process.py --mobile_imgs xxxx/combined --web_imgs xxxx/seeclick_web_imgs 
    --widgetcap_json xxxx/widget_captioning.json --ricosca_json xxxx/ricosca.json 
    --screensum_json xxxx/screen_captioning.json --web_json xxxx/seeclick_web.json 
    --coco_imgs xxxx/coco/train2017 --llava_json xxxx/llava_instruct_150k.jsonl

Generate the dataset containing about 1M samples for continual pre-training at ../data/sft_train.json.

GUI Grounding Pre-training

cd ..
bash finetune/finetune_lora_ds.sh --save-name seeclick_sft --max-length 768 --micro-batch-size 8 
    --save-interval 4000 --train-epochs 3 --nproc-per-node 8 --data-path ./data/sft_train.json 
    --learning-rate 3e-5 --gradient-accumulation-steps 1 --qwen-ckpt xxxx/Qwen-VL-Chat 
    --pretrain-ckpt xxxx/Qwen-VL-Chat  --save-path xxxx/checkpoint_qwen

Evaluation on ScreenSpot

cd pretrain
python screenspot_test.py --qwen_path xxxx/Qwen-VL-Chat --lora_path xxxx/checkpoint_qwen/seeclick_sft/checkpoint-20000 --screenspot_imgs xxxx/screenspot_imgs --screenspot_test xxxx/ScreenSpot --task all

Collecting Pre-training Data from Common Crawl

We used Selenium to crawl web pages from Common Crawl. See details in this repo.


Citation

@inproceedings{cheng2024seeclick,
    title = "{S}ee{C}lick: Harnessing {GUI} Grounding for Advanced Visual {GUI} Agents",
    author = "Cheng, Kanzhi  and
      Sun, Qiushi  and
      Chu, Yougang  and
      Xu, Fangzhi  and
      YanTao, Li  and
      Zhang, Jianbing  and
      Wu, Zhiyong",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.505",
    pages = "9313--9332"
}

License

This project incorporates specific datasets and checkpoints governed by their original licenses. Users are required to adhere to all terms of these licenses. No additional restrictions are imposed by this project beyond those specified in the original licenses.