DocLayout-YOLO/mesh-candidate_bestfit at main · opendatalab/DocLayout-YOLO

History

Name		Name	Last commit message	Last commit date
parent directory ..
utils		utils
README.md		README.md
augmentation.py		augmentation.py
bestfit_generator.py		bestfit_generator.py
combine_layouts.py		combine_layouts.py
map_dict.py		map_dict.py
rendering.py		rendering.py
visualize.ipynb		visualize.ipynb

README.md

Pretraining Data Generation via Mesh-candidate Bestfit

Mesh-candidate Bestfit iteratively inserts elements from a small set of public datasets by searching for the best match between sampled candidates and the available grids in the current layout, ultimately achieving document synthesis.

You can generate a large scale of diverse data for pretraining applying our proposed method Mesh-candidate Bestfit, just follow steps below:

1. Environment Setup

You need to install PyMuPDF for subsequent rendering via pip:

cd mesh-candidate_bestfit
pip install pymupdf==1.23.7

2. Preprocessing

Data Preparation

Two primary things need to be well prepared before starting generation:

1. Original Annotation File of Initial Dataset
- The annotation file follows COCO format, a JSON file contains images and instances annotations.
- Each instance should have a unique instance_id.
- The file should be placed under ./.
2. Element Pool

Element Pool is constructed according to annotation file. Specifically, crop all the instances images and organize them in a category-wise manner. The structure of element pool is as follows (folder is named by each category and cropped image is named by unique instance_id):
```
./element_pool
├── advertisement
│   ├── 727.jpg
│   ├── 919.jpg
│   ├── 1423.jpg
│   └── ...
├── algorithm
│   ├── 12653.jpg
│   ├── 17485.jpg
│   ├── 44364.jpg
│   └── ...
└── ...
```
Note: For convenience, we provide original annotation file and element pool for M6Doc-test dataset, which can be downloaded from annotation file and element pool, respectively. And you can run the script below to decompress the element pool properly:
```
unzip /path/to/your/element_pool.zip -d ./element_pool/
```
Data Augmentation(Optional)

If you want to apply our designed augmentation pipeline to your element pool, you can just run:
```
python augmentation.py --min_count 100 --aug_times 10
```
The script will perform augmentation pipeline aug_times times on each element of categories whose element number is less than min_count. If you want to generate large amount of data, try larger aug_times. In contrast, you want to shorten this process, try smaller aug_times. During DocSynth300K generation, we use --aug_times 50.
Map Dict

To facilitate the random selection of candidates during the rendering phase, it is necessary to establish a mapping from candidate elements to all of their candidate paths (passing --use_aug is augmentation is implemented):
```
python map_dict.py --save_path ./map_dict.json --use_aug
```

3. Layout Generation

Now, you can generate diverse layouts using Mesh-candidate Bestfit algorithm. To prevent process blocking, it will save the result of each layout in a timely manner, but you can use the combine_layouts.py script to combine them all together like this:

python bestfit_generator.py --generate_num 100 --n_jobs 5 --json_path ./annotation_file.json --output_dir ./generated_layouts/seperate
python combine_layouts.py --seperate_layouts_dir ./generated_layouts/seperate --save_path ./generated_layouts/combined_layouts.json

Afterwards, feel free to delete the seperate layouts since they are no longer used.

Note: Due to multiprocessing used in layout generation, set proper --n_jobs to avoid process blocking.

4. Rendering

Finally, you can render generated layouts and save the results in yolo format via the script below:

python rendering.py --json_path ./generated_layouts/combined_layouts.json --n_jobs 5 --map_dict_path ./map_dict.json --save_dir ./generated_dataset

Visualization

We provide visualize.ipynb to visualize the layouts generated by our proposed methods. Here, we display some generation cases below:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mesh-candidate_bestfit

mesh-candidate_bestfit

README.md

Pretraining Data Generation via Mesh-candidate Bestfit

1. Environment Setup

2. Preprocessing

3. Layout Generation

4. Rendering

Visualization

Files

mesh-candidate_bestfit

Directory actions

More options

Directory actions

More options

Latest commit

History

mesh-candidate_bestfit

Folders and files

parent directory

README.md

Pretraining Data Generation via Mesh-candidate Bestfit

1. Environment Setup

2. Preprocessing

3. Layout Generation

4. Rendering

Visualization