release dj v0.2.0 (dj_video) (#227)

* release dj v0.2.0 (dj_video) * authored by data-juicer team
modelscope · Mar 7, 2024 · 2720113 · 2720113
1 parent 475c52b
commit 2720113
Show file tree

Hide file tree

Showing 172 changed files with 11,515 additions and 1,040 deletions.
diff --git a/README.md b/README.md
diff --git a/README_ZH.md b/README_ZH.md
diff --git a/configs/config_all.yaml b/configs/config_all.yaml
diff --git a/configs/data_juicer_recipes/README.md b/configs/data_juicer_recipes/README.md
@@ -4,7 +4,7 @@ We found that there are still some "bad" samples in existing processed datasets
 
 We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
 
-## Before and after refining for Pretraining Dataset
+## Before and after refining for Pretraining Text Dataset
 
 | subset               |       #samples before       | #samples after | keep ratio | config link                                                                                                                                                                                                                        | data link                                                                                                                                                                                                                                                                                  | source                  |
 |----------------------|:---------------------------:|:--------------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
@@ -35,3 +35,17 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
 |------------------|:-------------------------:|:--------------------------------------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
 | Alpaca-Cot EN | 136,219,879               | 72,855,345 |   54.48%   | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer)                          | [39 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info)              |
 | Alpaca-Cot ZH | 21,197,246               |               9,873,214                |   46.58%   | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer)                          | [28 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info)              |
+
+## Before and after refining for Multimodal Dataset
+
+| subset                    |       #samples before       | #samples after | keep ratio | config link                          | data link                                                                                                                                                                                                                                                                                 | source        |
+|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
+| LLaVA pretrain (LCS-558k) |          558,128          |   500,380    |   89.65%   | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer)                                        | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
+
+### Evaluation Results
+- LLaVA pretrain (LCS-558k): models **pretrained with refined dataset** and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.
+
+| model                         | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
+|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| LLaVA-1.5-13B <br> (baseline) | **80.0**  | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
+| LLaVA-1.5-13B <br> (refined pretrain dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
diff --git a/configs/data_juicer_recipes/README_ZH.md b/configs/data_juicer_recipes/README_ZH.md
@@ -35,3 +35,17 @@
 |-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|
 | Alpaca-Cot EN     |       136,219,879        | 72,855,345 |   54.48%   | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml)                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer)   | [来自Alpaca-CoT的39个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
 | Alpaca-Cot ZH     |        21,197,246        |             9,873,214              |  46.58%   | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml)                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer)   | [来自Alpaca-CoT的28个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
+
+## 完善前后的多模态数据集
+
+| 数据子集                    |      完善前的样本数目       | 完善后的样本数目 | 样本保留率 | 配置链接                          | 数据链接                                                                                                                                                                                                                                                                                 | 来源            |
+|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
+| LLaVA pretrain (LCS-558k) |          558,128          |   500,380    |   89.65%   | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer)                                        | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
+
+### 评测结果
+- LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。
+
+| 模型                              | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
+|---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| LLaVA-1.5-13B <br> (基线)         | **80.0**  | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
+| LLaVA-1.5-13B <br> (完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
diff --git a/configs/data_juicer_recipes/llava-pretrain-refine.yaml b/configs/data_juicer_recipes/llava-pretrain-refine.yaml
@@ -0,0 +1,60 @@
+project_name: 'llava-1.5-pretrain-dataset-refine-recipe'
+dataset_path: 'blip_laion_cc_sbu_558k_dj_fmt_only_caption.jsonl'  # converted LLaVA pretrain dataset in Data-Juicer format with only_keep_caption is True. See tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py
+export_path: 'blip_laion_cc_sbu_558k_dj_fmt_only_caption_refined.jsonl'
+
+np: 42                                                            # number of subprocess to process your dataset
+text_keys: 'text'                                                 # the key name of field where the sample texts to be processed, e.g., `text`, `instruction`, `output`, ...
+
+# for multimodal data processing
+image_key: 'images'                                               # Key name of field to store the list of sample image paths.
+image_special_token: '<image>'                                    # The special token that represents an image in the text. For LLaVA, it's "<image>". Should be aligned with the args when running conversion tools.
+eoc_special_token: '<|__dj__eoc|>'                                # The special token that represents the end of a chunk in the text. In default, it's "<|__dj__eoc|>". You can specify your own special token according to your input dataset. Should be aligned with the args when running conversion tools.
+
+open_tracer: true
+
+# process schedule: a list of several process operators with their arguments
+process:
+  - fix_unicode_mapper:                                           # fix unicode errors in text.
+  - punctuation_normalization_mapper:                             # normalize unicode punctuations to English punctuations.
+
+  # 558128
+  # Filter ops
+  - alphanumeric_filter:           #558087                        # filter text with alphabet/numeric ratio out of specific range.
+      tokenization: false                                           # Whether to count the ratio of alphanumeric to the total number of tokens.
+      min_ratio: 0.60                                               # the min ratio of filter range
+  - character_repetition_filter:    #546105                       # filter text with the character repetition ratio out of specific range
+      rep_len: 10                                                   # repetition length for char-level n-gram
+      max_ratio: 0.09373663                                         # the max ratio of filter range
+  - flagged_words_filter:          #543960                        # filter text with the flagged-word ratio larger than a specific max value
+      lang: en                                                      # consider flagged words in what language
+      tokenization: false                                           # whether to use model to tokenize documents
+      max_ratio: 0.0                                                # the max ratio to filter text
+  - perplexity_filter:         #532029                            # filter text with perplexity score out of specific range
+      lang: en                                                      # compute perplexity in what language
+      max_ppl: 14435.5806                                           # the max perplexity score to filter text
+  - special_characters_filter:    #531968                         # filter text with special-char ratio out of specific range
+      min_ratio: 0.16534802                                          # the min ratio of filter range
+      max_ratio: 0.42023757                                          # the max ratio of filter range
+  - word_repetition_filter:   # 530773                            # filter text with the word repetition ratio out of specific range
+      lang: en                                                       # sample in which language
+      tokenization: false                                            # whether to use model to tokenize documents
+      rep_len: 10                                                    # repetition length for word-level n-gram
+      max_ratio: 0.03085751                                          # the max ratio of filter range
+
+  - image_aspect_ratio_filter:       #542389                      # filter samples according to the aspect ratios of images (a fraction of width by height, r=w/h) in them
+      min_ratio: 0.333                                              # the min aspect ratio of filter range
+      max_ratio: 3.0                                                # the max aspect ratio of filter range
+      any_or_all: any                                               # keep this sample when any/all images meet the filter condition
+  - image_shape_filter:             #533966                       # filter samples according to the widths and heights of images in them
+      max_width: 727.8798422276                                     # the max width of width filter range
+      max_height: 606.2421072264                                    # the max height of height filter range
+      any_or_all: any                                               # keep this sample when any/all images meet the filter condition
+  - image_size_filter:          # 533966                          # filter samples according to the size of images (in bytes) within them
+      max_size: "124KB"                                             # the max size of filter range
+      any_or_all: any                                               # keep this sample when any/all images meet the filter condition
+  - image_text_similarity_filter:     #544202                     # filter samples according to the similarity between text and images.
+      hf_clip: openai/clip-vit-base-patch32                         # name of used Hugging Face clip
+      min_score: 0.20315419                                         # the min similarity of filter range
+  - image_text_matching_filter:                                   # filter samples according to the matching score between image and text.
+      hf_blip: Salesforce/blip-itm-base-coco                        # name of used Hugging Face blip
+      min_score: 0.44930778                                         # the min matching score of filter range
diff --git a/data_juicer/config/config.py b/data_juicer/config/config.py
@@ -149,6 +149,18 @@ def init_configs(args=None):
         help='The special token that represents an audio in the text. In '
         'default, it\'s "<__dj__audio>". You can specify your own special'
         ' token according to your input dataset.')
+    parser.add_argument(
+        '--video_key',
+        type=str,
+        default='videos',
+        help='Key name of field to store the list of sample video paths.')
+    parser.add_argument(
+        '--video_special_token',
+        type=str,
+        default=SpecialTokens.video,
+        help='The special token that represents a video in the text. In '
+        'default, it\'s "<__dj__video>". You can specify your own special'
+        ' token according to your input dataset.')
     parser.add_argument(
         '--eoc_special_token',
         type=str,