Skip to content

Files

Latest commit

5891d9b · Jul 11, 2024

History

History

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jul 11, 2024

LAION-Glyph Dataset

  • LAION-Glyph 1M

File name: LAION-Glyph-1M.json.

[
    {
        "img_id": sample id with '\t' seprating two parts,e.g., "part-00012      00002014175"

        "img_code": the base64 code of the image, use Image.open(BytesIO(base64.b64decode(img_code))) to decode the original image

        "caption_origin": original caption provided by LAION dataset

        "caption_blip": the caption generated by BLIP-2

        "ocr_info": the information of multiple detected OCR bounding boxes, the format for each box: [
            [top_left, top_right, lower_right, lower_left],
            [text, confidence]
        ]
        e.g:[
            [[[102.0, 36.0], [250.0, 36.0], [250.0, 67.0], [102.0, 67.0]], ['BALTIMORE', 0.9966500401496887]], 
            [[[31.0, 75.0], [321.0, 75.0], [321.0, 102.0], [31.0, 102.0]], ['BUSINESSJOURNAL', 0.9743010997772217]]
            ]
    },
    ...
]
  • LAION-Glyph 10M

There are 10 files in total. Each contains 1M samples with the same format like LAION-Glyph 1M. File name: LAION-Glyph-10M_x.json. (x = 0-9)

[Notes]

  • Since each json file has large size (~100GB), it would be better to split each json file into multiple (e.g., 10 or 100) json files with smaller size.