Official repo for GeoX

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

📃Arxiv Paper • 🎒Data • 🤗Checkpoint • 📖Citation

🏃 Intro GeoX

GeoX is a multi-modal large model designed for automatic geometric problem solving, incorporating three progressive training stages to enhance diagram understanding and reasoning. In this paper, we validate that the formal vision-language training is a simple-yet-effective paradigm for complex mathematical diagram learning.

Abstract

Despite their proficiency in general tasks, Multi-modal Large Language Models (MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands understanding diagrams, interpreting symbols, and performing complex reasoning. This limitation arises from their pre-training on natural images and texts, along with the lack of automated verification in the problem-solving process. Besides, current geometric specialists are limited by their task-specific designs, making them less effective for broader geometric problems. To this end, we present GeoX, a multi-modal large model focusing on geometric understanding and reasoning tasks. Given the significant differences between geometric diagram-symbol and natural image-text, we introduce unimodal pre-training to develop a diagram encoder and symbol decoder, enhancing the understanding of geometric images and corpora. Furthermore, we introduce geometry-language alignment, an effective pre-training paradigm that bridges the modality gap between unimodal geometric experts. We propose a Generator-And-Sampler Transformer (GS-Former) to generate discriminative queries and eliminate uninformative representations from unevenly distributed geometric signals. Finally, GeoX benefits from visual instruction tuning, empowering it to take geometric images and questions as input and generate verifiable solutions. Experiments show that GeoX outperforms both generalists and geometric specialists on publicly recognized benchmarks, such as GeoQA, UniGeo, Geometry3K, and PGPS9k. Our data and code will be released soon to accelerate future research on automatic GPS.

🚩 News

[2024/12/30] Full version of the code and training scripts will be released within the next few days.
[2024/10/17] Upload paper and init project. Release the data for GeoX. See here.

⚡ Set up

Environment Setup

Step 1. Build Dependencies. Our code is tested with Python 3.10.14. To run the codes, you should first install the following packages:

conda create -n geox python=3.10
conda activate geox
pip install --upgrade pip
pip install -r requirements.txt
pip install flash-attn==2.5.9.post1 --no-build-isolation

Data and Weights Preparation

Step 1. Download and Prepare Data.

Follow the instructions here and download full dataset for GeoX.
To train the model, you are required to organize the files into the following folders:

./data/

  alignment/
    images/
    unified_formal_annotations.json

  geoqa/
    images/
    geoqa_train.json
    geoqa_test.json

  unigeo/
    images/
    unigeo_train.json
    unigeo_test.json

  geometry3k/
    images/
    geometry3k_train.json
    geometry3k_test.json

  pgps9k/
    images/
    pgps9k_train.json
    pgps9k_test.json

💻 Train your own model

(Optional) Pretraining


# Pretrain Geo-VIT
BASE_DIR="/path/to/your/base/directory"  # Modify this path as necessary
OUTPUT_DIR="/path/to/your/output/directory"  # Modify this path as necessary

python ${BASE_DIR}/pretrain/pretrain_encoder.py \
    --job_dir ${OUTPUT_DIR}/checkpoint/mae \
    --nodes 1 \
    --ngpus 8 \
    --accum_iter 16 \
    --batch_size 256 \
    --use_volta32 \
    --model mae_vit_base_patch16 \
    --mask_ratio 0.75 \
    --epochs 800 \
    --warmup_epochs 40 \
    --blr 1.5e-4 \
    --weight_decay 0.05 \
    --data_path ${BASE_DIR}/data  # Ensure the data path is correctly parameterized


# Pretrain Geo-LLM
DATA_FILE="/path/to/your/training/data"  # Modify this path as necessary
OUTPUT_DIR="/path/to/your/output/directory"  # Modify this path as necessary
MODEL_DIR="/path/to/LLEMMA/directory"  # Modify this path as necessary
LOG_FILE="${OUTPUT_DIR}/train.log"

if [ ! -d "${OUTPUT_DIR}" ]; then  
    mkdir -p "${OUTPUT_DIR}"
fi

GPU_DEVICES=""

deepspeed --include=localhost:${GPU_DEVICES} \
    main/train_llm.py \
    --config_name "${MODEL_DIR}/config.json" \
    --tokenizer_name "${MODEL_DIR}" \
    --model_name_or_path "${MODEL_DIR}" \
    --train_files "${DATA_FILE}" \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 32 \
    --do_train \
    --output_dir "${OUTPUT_DIR}" \
    --evaluation_strategy steps \
    --use_fast_tokenizer false \
    --max_eval_samples 0 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 10 \
    --warmup_ratio 0.1 \
    --logging_dir "${OUTPUT_DIR}/logs" \
    --logging_strategy steps \
    --logging_steps 50 \
    --save_strategy steps \
    --preprocessing_num_workers 10 \
    --save_steps 20000000 \
    --eval_steps 500000000 \
    --save_total_limit 2000 \
    --seed 42 \
    --disable_tqdm false \
    --ddp_find_unused_parameters false \
    --block_size 1024 \
    --overwrite_output_dir \
    --report_to tensorboard \
    --run_name llm_pretrain \
    --bf16 \
    --bf16_full_eval \
    --gradient_checkpointing \
    --deepspeed configs/models/zero3.json \
    --ignore_data_skip true \
    --ddp_timeout 18000000 \
    | tee -a "${LOG_FILE}"

Finetune on Geometry Data

MODEL_DIR="/path/to/your/model/directory"  # Modify this path as necessary
Text_FILE="/path/to/your/training/data"  # Modify this path as necessary
IMAGE_FOLDER="/path/to/your/image/folder"  # Modify this path as necessary
OUTPUT_DIR="/path/to/your/output/directory"  # Modify this path as necessary
LOG_FILE="${OUTPUT_DIR}/train.log"
GPU_DEVICES=""

if [ ! -d "${OUTPUT_DIR}" ]; then  
    mkdir -p "${OUTPUT_DIR}"
fi
export MASTER_PORT=20728

deepspeed --include=localhost:${GPU_DEVICES} --master_port=$MASTER_PORT main/train_geox.py \
    --deepspeed ./configs/models/zero2.json \
    --model_name_or_path "${MODEL_DIR}" \
    --version geo_v1 \
    --data_path "${Text_FILE}" \
    --image_folder "${IMAGE_FOLDER}" \
    --vision_tower "${MODEL_DIR}/geo-vit.pth" \
    --gsformer_path "${MODEL_DIR}/gsformer.pth" \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir "${OUTPUT_DIR}" \
    --num_train_epochs 100 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 100 \
    --learning_rate 3e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.05 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 0 \
    --lazy_preprocess True \
    | tee -a "${LOG_FILE}"

📖 Citation

If you find our work helps, please consider starring ⭐ us and citing:

@misc{xia2024geoxgeometricproblemsolving,
      title={GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training}, 
      author={Renqiu Xia and Mingsheng Li and Hancheng Ye and Wenjie Wu and Hongbin Zhou and Jiakang Yuan and Tianshuo Peng and Xinyu Cai and Xiangchao Yan and Bin Wang and Conghui He and Botian Shi and Tao Chen and Junchi Yan and Bo Zhang},
      year={2024},
      eprint={2412.11863},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.11863}, 
}

Acknowledgments

Thanks to LLaVA, LAVIS, MAE, and trasnformers. We borrow some of their codes and checkpoints.

License

This code is distributed under an Apache-2.0 license. If there are any problems regarding our project, please open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
configs		configs
data		data
eval		eval
main		main
models		models
utils		utils
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Official repo for GeoX

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

🏃 Intro GeoX

🚩 News

⚡ Set up

💻 Train your own model

📖 Citation

Acknowledgments

License

About

Releases

Packages

Contributors 3

Languages

License

UniModal4Reasoning/GeoX

Folders and files

Latest commit

History

Repository files navigation

Official repo for GeoX

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

🏃 Intro GeoX

🚩 News

⚡ Set up

💻 Train your own model

📖 Citation

Acknowledgments

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages