LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation

The official source code for LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation, accepted at CVPR 2024.

Overview

Addressing two issues inherent in the conventional approach(Parser+Knowledge Base(WordNet))

Semantic Over-simplification (Step 2)
The standard scene graph parser commonly leads to converting the fine-grained predicates into coarse-grained predicates, which we refer to as semantic over-simplification. For example, in Figure (c), an informative predicate lying on in the image caption is undesirably converted into a less informative predicate on, because the scene parser operating on rule-based fails to capture the predicate lying on at once, and its heuristic rules fall short of accommodating the diverse range of caption's structure. As a result, in Figure (b), the predicate distribution follows long-tailedness. To make matter worse, 12 out of 50 predicates are non-existent, which means that these 12 predicates can never be predicted.
Low-density Scene Graph (Step 3)
The triplet alignment based on knowledge base (i.e., WordNet) leads to low-density scene graphs, i.e., the number of remaining triplets after Step 3 is small. Specifically, a triplet is discarded if any of three components (i.e., subject, predicate, object) or their synonym/hypernym/hyponym within the triplet fail to align with the entity or predicate classes in the target data. For example, in Figure (d), the triplet <elephant, carrying, log> is discarded because log does not exist in the target data nor its synonym/hypernym, even if elephant and carrying do exist. As a result, a large number of predicates is discarded, resulting in a poor generalization and performance degradation. This is attributed to the fact that the static structured knowledge of KB is insufficient to cover the semantic relationships among a wide a range of words.

Proposed Approach: LLM4SGG

To alleviate the two issues aforementioned above, we adopt a pre-trained Large Language Model (LLM). Inspired by the idea of Chain-of-Thoughts (CoT), which arrives at an answer in a stepwise manner, we seperate the triplet formation process into two chains, each of which replaces the rule-based parser in Step 2 (i.e., Chain-1) and the KB in Step 3 (i.e., Chain-2).

Regarding an LLM, we employ gpt-3.5-turbo in ChatGPT.

TODO List

Release prompts and codes for training the model with Conceptual caption dataset
Release enhanced scene graph datasets of Conceptual caption
Release prompts and codes for training the model with Visual Genome caption dataset
Release enhanced scene graph datasets of Visual Genome caption

Installation

conda create -n llm4sgg python=3.9.0 -y
conda activate llm4sgg

pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install openai einops shapely timm yacs tensorboardX ftfy prettytable pymongo tqdm numpy python-magic pandas
pip install transformers==4.35.0

Once the package has been installed, please run setup.py file.

python setup.py build develop --user

Dataset

Please refer to dataset/README.md

Triplet Extraction Process

You can find detailed explanation of triplet extraction process in triplet_extraction_process/README.md

Train

The detailed paths of localized triplets are in maskrcnn_benchmark/config/paths_catalog.py file.

Test set: VG

Models trained on caption datasets (e.g., COCO, CC, and VG Caption) are evaluated on VG test dataset.

The required file (i.e., localized triplets made by LLM4SGG) and pre-trained model (i.e., GLIP) will be automatically downloaded to facilitate your implementation. Simply change the DATASET name as needed.

※ The required files for Grounded Scene Graphs could not be downloaded due to the web error. If you have this problem, please visit https://huggingface.co/datasets/kb-kim/LLM4SGG and directly download the files.

Single GPU

# DATASET: coco,cc,and vgcaption
bash scripts/single_gpu/train_{DATASET}4vg.sh

Multi GPU

# DATASET: coco,cc,and vgcaption
bash scripts/multi_gpu/train_{DATASET}4vg.sh

If you want to train model with reweighting strategy, please run the code.

# Training data: COCO
bash scripts/{multi_gpu or single_gpu}/train_coco4vg_rwt.sh

Test set: GQA

bash scripts/{multi_gpu or single_gpu}/train_coco4gqa.sh

Test

# Please change model checkpoint in test.sh file
bash scripts/test.sh

We also provide pre-trained models and other results. The link for Grounded Scene Graphs is connected to Google Drive.

Citation

@InProceedings{Kim_2024_CVPR,
    author    = {Kim, Kibum and Yoon, Kanghoon and Jeon, Jaehyeong and In, Yeonjun and Moon, Jinyoung and Kim, Donghyun and Park, Chanyoung},
    title     = {LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {28306-28316}
}

Acknowledgement

The code is developed on top of VS3.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
dataset		dataset
figure		figure
maskrcnn_benchmark.egg-info		maskrcnn_benchmark.egg-info
maskrcnn_benchmark		maskrcnn_benchmark
scripts		scripts
tools		tools
triplet_extraction_process		triplet_extraction_process
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation

Overview

Proposed Approach: LLM4SGG

TODO List

Installation

Dataset

Triplet Extraction Process

Train

Test set: VG

Single GPU

Multi GPU

Test set: GQA

Test

COCO → VG test

VG Caption → VG test

CC → VG test

COCO → GQA test

Citation

Acknowledgement

About

Releases

Packages

Languages

rlqja1107/torch-LLM4SGG

Folders and files

Latest commit

History

Repository files navigation

LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation

Overview

Proposed Approach: LLM4SGG

TODO List

Installation

Dataset

Triplet Extraction Process

Train

Test set: VG

Single GPU

Multi GPU

Test set: GQA

Test

COCO → VG test

VG Caption → VG test

CC → VG test

COCO → GQA test

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages