- LAION-Glyph 1M
File name: LAION-Glyph-1M.json
.
[
{
"img_id": sample id with '\t' seprating two parts,e.g., "part-00012 00002014175"
"img_code": the base64 code of the image, use Image.open(BytesIO(base64.b64decode(img_code))) to decode the original image
"caption_origin": original caption provided by LAION dataset
"caption_blip": the caption generated by BLIP-2
"ocr_info": the information of multiple detected OCR bounding boxes, the format for each box: [
[top_left, top_right, lower_right, lower_left],
[text, confidence]
]
e.g:[
[[[102.0, 36.0], [250.0, 36.0], [250.0, 67.0], [102.0, 67.0]], ['BALTIMORE', 0.9966500401496887]],
[[[31.0, 75.0], [321.0, 75.0], [321.0, 102.0], [31.0, 102.0]], ['BUSINESSJOURNAL', 0.9743010997772217]]
]
},
...
]
- LAION-Glyph 10M
There are 10 files in total. Each contains 1M samples with the same format like LAION-Glyph 1M.
File name: LAION-Glyph-10M_x.json
. (x = 0-9)
[Notes]
- Since each json file has large size (~100GB), it would be better to split each json file into multiple (e.g., 10 or 100) json files with smaller size.