-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev/multimodal format & basic processing pipeline for multimodal datasets support #64
Conversation
* add more details in the error information when a dataset_path is invalid
This PR is under an internal test now. Don't merge it until it's done.
Update 2023/11/06: all pipeline validations are passed; processed and converted datasets are acceptable to the target model. |
+ add new args into config_all.yaml
…ocessed dataset to the relative paths like the original dataset
…due to missing column or unaligned data type
# Conflicts: # data_juicer/format/formatter.py # data_juicer/ops/op_fusion.py
+ support to expert to json format (not jsonl!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaved some minor suggestions
llava2jd
anddj2llava
, which support convert LLaVA-like datasets to Data-Juicer format and reverse. -- for basic pipeline test