PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

This repository is for the paper "PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM" (under review).

Usage Acknowledgement

Notice we only authorize using the proposed dataset for scientific research. One should NOT use it for commercial purposes without our authorization.

🗓️ Schedule

[2024.03.26] Release online demo and pre-trained model on hugging face🤗.

[2024.06.05] Release arXiv paper📝.

[2024.07.04] Release QB-Poster dataset📊. (raw files contain original poster images and JSON annotations, inpainting and saliency detection techniques are needed for obtaining background images and saliency maps. Our paper used lama for inpainting and basenet for saliency detection.)

[2024.07.04] Release User-Constrained dataset📊. (only include user-constraint annotation files. please refer to the CGL-dataset and PosterLayout dataset to get the poster images and bounding box annotations.)

[2024.07.04] Release data pre-processing, training, and inferencing code.

[Coming Soon] Release evaluation code.

Environment

Run the following code to build the environment.

pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Data Processing

Download the dataset files and arrange them as follows (QB-Poster as an example). Run the saliency detection method to get 'saliency_map' and the inpainting method to get 'inpainted_1x' and 'inpainted_1d5x' (used for inference and training respectively; notice we randomly inpainted 0.5x more regions besides the ground-truth bounding box area to avoid overfitting.)

├── data
│  ├── prompt_template.txt
│  └── qbposter <--
│      ├── get_prompt.py
|      └── raw
│          ├── original_poster
│          ├── saliency_map
│          ├── inpainted_1x
│          ├── inpainted_1d5x
│          └── annotation.json
...
└── README.md

Run the data preprocessing script.

python data/qbposter/get_prompt.py

Ultimately you will get two processed JSON files (each containing instruction-answer pairs) like this.

├── data
│  ├── prompt_template.txt
│  └── qbposter
│        ├── get_prompt.py
│        ├── qbposter_train_instruct.json <--
│        └── qbposter_val_instruct.json   <--
...
└── README.md

Training

Please download LLaVa-v1.5 pre-trained checkpoint and CLIP vision encoder first and put it in the 'huggingface' subfolder.

├── data
├── huggingface <--
|      ├── llava-v1.5-7b
|      └── clip-vit-large-patch14-336
├── scripts
|      └── qbposter
|            ├── finetune.sh <--
|            └── inference.sh
...
└── README.md

Then run the following script.

qbposter/finetune.sh

Inference

Please download the pre-trained PosterLLaVa_v0 checkpoint, which is initialized with LLaVa-v1.5 checkpoint and fine-tuned on the following combined datasets.

7k banner layouts from Ad Banner dataset.
60k commercial poster layouts from CGL-dataset and PosterLayout with text constraints.
4k social media poster layouts from QB-Poster dataset.

Put it in the 'pretrained_model' subfolder.

├── data
├── huggingface
├── pretrained_model <--
|      └── posterllava_v0
├── scripts
|      └── qbposter
|            ├── finetune.sh
|            └── inference.sh <--
...
└── README.md

Then run the following script to generate JSON format layout.

qbposter/inference.sh

Evaluation

Coming Soon...

Citation

If you find this project/paper useful, please give us a star/citation.

@misc{yang2024posterllava,
      title={PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM}, 
      author={Tao Yang and Yingmin Luo and Zhongang Qi and Yang Wu and Ying Shan and Chang Wen Chen},
      year={2024},
      eprint={2406.02884},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.02884}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
llava		llava
scripts/qbposter		scripts/qbposter
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
framework.png		framework.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Usage Acknowledgement

🗓️ Schedule

Environment

Data Processing

Training

Inference

Evaluation

Citation

About

Releases

Packages

Languages

License

posterllava/PosterLLaVA

Folders and files

Latest commit

History

Repository files navigation

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Usage Acknowledgement

🗓️ Schedule

Environment

Data Processing

Training

Inference

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages