WEBVICOB 🕸 : Web-based Visual Corpus Builder

Official Implementation of Web-based Visual Corpus Builder (WEBVICOB)

Introduction

WEBVICOB 🕸, Web-based Visual Corpus Builder, is a dataset generator that can readily construct a large-scale visual corpus (i.e., images with text annotations) from a raw Wikipedia HTML dump. The constructed visual corpora can be utilized in building Visual Document Understanding (VDU) backbones. Our academic paper, which describes our engine in detail and provides full experimental results and analyses, can be found here:

On Web-based Visual Corpus Construction for Visual Document Understanding.
Donghyun Kim, Teakgyu Hong, Moonbin Yim, Yoonsik Kim and Geewook Kim. In ICDAR 2023.

Updates

2023-05-03 Our paper is accepted at ICDAR2023. A new version of the paper has been published on arXiv.
2023-02-11 HTML Section Chunker added, Solve memory-leak issue.
2022-11-08 Paper published on arxiv.
2022-11-04 First Commit, We release the codebase.

How to Use

Environment

python >= 3.8

Init submodule (google font)

We use GoogleFonts for various visual information. (font/google)
Init submodule if you want to use it.

$ git submodule update --init --recursive

Install dependencies (Tested on ubuntu18.04)

$ bash install_dependencies.sh

Download chorme driver

You can download various versions of chrome-drivers from here. Please note that you should match chrome driver version with your system's installed one.

$ google-chrome --version
Google Chrome 106.0.5249.103

Install python packages

$ pip install -U six wheel setuptools
$ pip install -r requirements.txt

Run

JUST DO IT FIRST !! RUN FOLLOWING SCRIPT !!
To visualize outputs, you should use "debug" option.

$ PYTHONPATH=$PWD python webvicob/wikipedia/wikipedia.py \
    --chrome_path=/path/to/your/chrome/driver \
    --workspace=./resources/workspace_example \
    --target_lang=en \
    --num_train=10 \
    --num_val=1 \
    --num_test=1 \
    --debug=True

Available options

option	default	desc
workspace (str)	./	Dir to load json files and save lmdb.
chrome_path (str)	resources/chromedriver	Path of your chorme driver
target_lang (str)	ja	Whatever you want.
num_train (int)	-1	Number of train samples.
num_val (int)	0	Number of val samples.
num_test (int)	0	Number of test samples.
debug (bool)	False	Debug option.
num_process (int)	-1	Number of processes. -1 ==> os.cpu_count() value is used.
shrink_heuristic (bool)	True	Use heuristic shrinking of character boxes.
remove_background (bool)	True	Remove background img of html.
unroll_contents (bool)	False	Unroll html contents.
change_para_font (bool)	True	Change paragraph fonts with google-fonts.
sleep_time (int)	1	sleep time for every render.
capture_widths (tuple[int])	(800, 1200, 1600)	Randomly select capture width. This is different from final_width. This option determines the width of the browser when rendering. final_width is an option to resize the finally rendered image and annotations.
capture_height_limit (int)	16384	Skip the rendering process if rendered page's height is larger than the limit value.
final_width (int)	None	Final save img width size. (Useful when you do not have a lot of storage)
chunk_idx (int)	None	Chunk index of json_list. Useful when you have multiple computers.
total_chunk (int)	None	Total number of chunks of json_list.
html_section_chunker (bool)	True	Chunk HTML by section. This options is very useful when HTML page has a lot of contents. Experiments in paper didn't use chunk option.
font_dir_path (str)	font_dir_path	Font directory path

Prepare Dataset

We made sample ndjson files on resources/workspace_example.
Each sample ndjson files has 100 samples.

If you want to download whole crawled data,
Download ndjson files ([lang]wiki-NS0-[version]-ENTERPRISE-HTML.json.tar.gz) at https://dumps.wikimedia.org/other/enterprise_html/runs
And untar ndjson files on [your workspace path]/raw.

Visualization

character	word	line	paragraph	image

How to Cite

If you find this work useful to you, please cite:

@InProceedings{kim2023web,
  title     = {On Web-based Visual Corpus Construction for Visual Document Understanding},
  author    = {Kim, Donghyun and Hong, Teakgyu and Yim, Moonbin and Kim, Yoonsik and Kim, Geewook},
  booktitle = {Document Analysis and Recognition - ICDAR 2023},
  year      = {2023},
}

How to Contribute

Please use pre-commit which uses Black and Isort.

$ pip install pre-commit
$ pre-commit install

Step By Step

Open new issue.
Match code style (black, isort)
1. execute commands in webvicob directory.
2. black .
3. isort --profile black .
Write test code.
Branch ([date]_[whatever]).
Delete branch after Squash&Merge.

Required Approve: 1

License

WEBVICOB is licensed under Apache-2.0, except resources/workspace_example/raw which is adopted from https://dumps.wikimedia.org/other/enterprise_html/ under CC BY-SA 3.0 See LICENSE for the full license text.

WEBVICOB
Copyright 2022-present NAVER Corp.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WEBVICOB 🕸 : Web-based Visual Corpus Builder

Introduction

Updates

How to Use

Environment

Init submodule (google font)

Install dependencies (Tested on ubuntu18.04)

Download chorme driver

Install python packages

Run

Available options

Prepare Dataset

Visualization

How to Cite

How to Contribute

Step By Step

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

WEBVICOB 🕸 : Web-based Visual Corpus Builder

Introduction

Updates

How to Use

Environment

Init submodule (google font)

Install dependencies (Tested on ubuntu18.04)

Download chorme driver

Install python packages

Run

Available options

Prepare Dataset

Visualization

How to Cite

How to Contribute

Step By Step

License