Dev/multimodal format & basic processing pipeline for multimodal datasets support #64

HYLcool · 2023-11-02T10:04:21Z

Support the basic Data-Juicer intermediate format for multimodal datasets. The detailed format information will be clarified later in some docs.
- Support image loading, new configs, new special tokens, new OP abstraction
Add a basic image OP: image_aspect_ratio_filter. -- for basic pipeline test
Add two format conversion tools: llava2jd and dj2llava, which support convert LLaVA-like datasets to Data-Juicer format and reverse. -- for basic pipeline test

* add more details in the error information when a dataset_path is invalid

HYLcool · 2023-11-02T10:06:18Z

This PR is under an internal test now. Don't merge it until it's done.

ds2dj -> dj-process -> dj2ds pipeline validation
ds2dj -> dj-process -> dj2ds -> training pipeline validation
- pretrain
- finetuning

Update 2023/11/06: all pipeline validations are passed; processed and converted datasets are acceptable to the target model.

+ add new args into config_all.yaml

…ocessed dataset to the relative paths like the original dataset

…due to missing column or unaligned data type

# Conflicts: # data_juicer/format/formatter.py # data_juicer/ops/op_fusion.py

+ support to expert to json format (not jsonl!)

zhijianma

LGTM

yxdyc

leaved some minor suggestions

configs/config_all.yaml

data_juicer/config/config.py

data_juicer/ops/base_op.py

HYLcool added 5 commits October 31, 2023 17:33

+ add basic support to multimodal datasets

f4ede25

+ add image_aspect_ratio_filter and its unit test

ef6b5e2

Merge branch 'main' into dev/mm_fmt

5cdbae7

* update docs for image_aspect_ratio_filter

086fc31

+ add llava2dj and dj2llava tools

91f4ac4

* add more details in the error information when a dataset_path is invalid

HYLcool added enhancement New feature or request dj:multimodal issues/PRs about multimodal data processing labels Nov 2, 2023

HYLcool added this to the Multimodal Support milestone Nov 2, 2023

HYLcool self-assigned this Nov 2, 2023

This was linked to issues Nov 2, 2023

[MM] llava2dj & dj2llava tools #58

Closed

[MM] image_aspect_ratio_filter #57

Closed

This was linked to issues Nov 2, 2023

[MM enhancement] image data loading #56

Closed

[MM enhancement] support text-based interleaved multimodal data as the intermediate format #49

Closed

[enhance] refine the log info when input dataset_path is invalid #55

Closed

HYLcool added 3 commits November 2, 2023 20:28

* fix typos

1b97081

+ add new args into config_all.yaml

+ add args to control whether to convert the absolute paths in the pr…

5fc1f62

…ocessed dataset to the relative paths like the original dataset

* avoid unaligned columns when converting and processing the dataset …

473b088

…due to missing column or unaligned data type

HYLcool mentioned this pull request Nov 8, 2023

added auto-HPO feature with WandB #65

Merged

HYLcool requested review from chenhesen, yxdyc and zhijianma November 8, 2023 08:56

HYLcool added 2 commits November 8, 2023 20:23

Merge branch 'main' into dev/mm_fmt

cd817ab

# Conflicts: # data_juicer/format/formatter.py # data_juicer/ops/op_fusion.py

* avoid missing image_key error for text-only datasets

d0e69d7

+ support to expert to json format (not jsonl!)

HYLcool requested a review from drcege November 10, 2023 07:29

* modify tool dir names to a detailed version

672839d

HYLcool requested a review from BeachWang November 10, 2023 07:56

yxdyc linked an issue Nov 13, 2023 that may be closed by this pull request

export_path can not be a folder #66

Closed

2 tasks

* unify the formation form for special tokens

23e4e3a

zhijianma approved these changes Nov 13, 2023

View reviewed changes

yxdyc reviewed Nov 13, 2023

View reviewed changes

configs/config_all.yaml Outdated Show resolved Hide resolved

data_juicer/config/config.py Outdated Show resolved Hide resolved

data_juicer/ops/base_op.py Outdated Show resolved Hide resolved

* minor modifications

acaeb10

yxdyc approved these changes Nov 13, 2023

View reviewed changes

HYLcool merged commit 16d159f into main Nov 13, 2023
2 checks passed

HYLcool deleted the dev/mm_fmt branch November 15, 2023 03:22

HYLcool added the dj:op issues/PRs about some specific OPs label Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev/multimodal format & basic processing pipeline for multimodal datasets support #64

Dev/multimodal format & basic processing pipeline for multimodal datasets support #64

HYLcool commented Nov 2, 2023

HYLcool commented Nov 2, 2023 •

edited

Loading

zhijianma left a comment

yxdyc left a comment

Dev/multimodal format & basic processing pipeline for multimodal datasets support #64

Dev/multimodal format & basic processing pipeline for multimodal datasets support #64

Conversation

HYLcool commented Nov 2, 2023

HYLcool commented Nov 2, 2023 • edited Loading

zhijianma left a comment

Choose a reason for hiding this comment

yxdyc left a comment

Choose a reason for hiding this comment

HYLcool commented Nov 2, 2023 •

edited

Loading