Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev/multimodal format & basic processing pipeline for multimodal datasets support #64

Merged
merged 13 commits into from
Nov 13, 2023

Conversation

HYLcool
Copy link
Collaborator

@HYLcool HYLcool commented Nov 2, 2023

  • Support the basic Data-Juicer intermediate format for multimodal datasets. The detailed format information will be clarified later in some docs.
    • Support image loading, new configs, new special tokens, new OP abstraction
  • Add a basic image OP: image_aspect_ratio_filter. -- for basic pipeline test
  • Add two format conversion tools: llava2jd and dj2llava, which support convert LLaVA-like datasets to Data-Juicer format and reverse. -- for basic pipeline test

@HYLcool HYLcool added enhancement New feature or request dj:multimodal issues/PRs about multimodal data processing labels Nov 2, 2023
@HYLcool HYLcool added this to the Multimodal Support milestone Nov 2, 2023
@HYLcool HYLcool self-assigned this Nov 2, 2023
This was linked to issues Nov 2, 2023
@HYLcool
Copy link
Collaborator Author

HYLcool commented Nov 2, 2023

This PR is under an internal test now. Don't merge it until it's done.

  • ds2dj -> dj-process -> dj2ds pipeline validation
  • ds2dj -> dj-process -> dj2ds -> training pipeline validation
    • pretrain
    • finetuning

Update 2023/11/06: all pipeline validations are passed; processed and converted datasets are acceptable to the target model.

+ add new args into config_all.yaml
…ocessed dataset to the relative paths like the original dataset
…due to missing column or unaligned data type
# Conflicts:
#	data_juicer/format/formatter.py
#	data_juicer/ops/op_fusion.py
+ support to expert to json format (not jsonl!)
@HYLcool HYLcool requested a review from drcege November 10, 2023 07:29
@HYLcool HYLcool requested a review from BeachWang November 10, 2023 07:56
@yxdyc yxdyc linked an issue Nov 13, 2023 that may be closed by this pull request
2 tasks
Copy link
Collaborator

@zhijianma zhijianma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@yxdyc yxdyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaved some minor suggestions

configs/config_all.yaml Outdated Show resolved Hide resolved
data_juicer/config/config.py Outdated Show resolved Hide resolved
data_juicer/ops/base_op.py Outdated Show resolved Hide resolved
@HYLcool HYLcool merged commit 16d159f into main Nov 13, 2023
2 checks passed
@HYLcool HYLcool deleted the dev/mm_fmt branch November 15, 2023 03:22
@HYLcool HYLcool added the dj:op issues/PRs about some specific OPs label Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dj:multimodal issues/PRs about multimodal data processing dj:op issues/PRs about some specific OPs enhancement New feature or request
Projects
None yet
3 participants