Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance/mmc4 converting tools #91

Merged
merged 4 commits into from
Nov 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions tools/multimodal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ For now, dataset formats that are supported by Data-Juicer are listed in the fol
| Format | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref. |
|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
| LLaVA-like | `llava_to_dj.py` | `dj_to_llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
| MMC4-like | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [Format Description](https://github.com/allenai/mmc4#documents) |
| WavCaps-like | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) |

For all tools, you can run the following command to find out the usage of them:
Expand Down Expand Up @@ -93,6 +94,19 @@ and converted datasets, so we can regard this sample is aligned with the origina
]
```

#### MMC4-like

The format of MMC4-like datasets are defined [here](https://github.com/allenai/mmc4#documents). Except `image_info` and `text_list`,
which are used when converting them to Data-Juicer format, there is an important field `similarity_matrix`. Similarity matrix is
a matrix of shape `len(image_info) x len(text_list)`, which means it highly depends on the numbers of images and text sentences and their
orders.

However, when processing such datasets with Data-Juicer, images or sentences might be removed from a sample by Filters, and they could be
modified by some Mappers. Thus, after processing, this similarity matrix might be no longer aligned with `image_info` or `text_list`.
Users should be cautious about this point if you need this matrix in later usages.

Despite these extra fields, tools for MMC4 can perfectly convert MMC4-like datasets to Data-Juicer-format datasets and convert them back~

### WavCaps-like

The [WavCaps](https://github.com/XinhaoMei/WavCaps#dataset) is composed of four sub-datasets: [FreeSound](https://freesound.org/), [BBC Sound Effects](https://sound-effects.bbcrewind.co.uk/),[SoundBible](https://soundbible.com/) and [AudioSet Strongly-labelled Subset](https://research.google.com/audioset/download_strong.html). Each sub-dataset has different fields. For example, the 'description' field is included in SoundBible, but does not exist in AudioSet. To ensure that the different sub-datasets can be properly merged after conversion, the union of all fields from the sub-datasets is used during the wavcaps_to_dj stage, and all fields are fully retained during the dj_to_wavcaps stage.
Expand Down
15 changes: 12 additions & 3 deletions tools/multimodal/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,10 @@

目前,Data-Juicer 支持的数据集格式在下面表格中列出。

| 格式 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 |
|-----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
| 类LLaVA格式 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
| 格式 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 |
|----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
| 类LLaVA格式 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
| 类MMC4格式 | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [格式描述](https://github.com/allenai/mmc4#documents) |
| 类WavCaps格式 | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) |

对于所有工具,您可以运行以下命令来了解它们的详细用法:
Expand Down Expand Up @@ -76,6 +77,14 @@ python tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --hel
]
```

#### 类MMC4格式

类MMC4数据集的格式在 [这里](https://github.com/allenai/mmc4#documents) 定义。除了在转换为Data-Juicer格式时使用的`image_info`和`text_list`之外,还有一个重要的字段`similarity_matrix`,即相似度矩阵。相似度矩阵是一个形状为`len(image_info) x len(text_list)`的矩阵,这意味着它高度依赖于图像和文本句子的数量及其顺序。

然而,当使用Data-Juicer处理这些数据集时,图像或句子可能会被Filter算子从样本中移除,并且它们可能会被一些Mapper算子修改。因此,在处理后,这个相似度矩阵可能无法与`image_info`或`text_list`对齐。如果用户在后续使用中需要这个矩阵,那您应该注意到这一点。

除了这些额外字段外,针对类MMC4格式的工具可以完美地将类MMC4格式的数据集转换为Data-Juicer格式的数据集,并将它们转换回去~

#### 类WavCaps格式
[WavCaps](https://github.com/XinhaoMei/WavCaps#dataset) 数据集由 [FreeSound](https://freesound.org/),[BBC Sound Effects](https://sound-effects.bbcrewind.co.uk/),[SoundBible](https://soundbible.com/),[AudioSet Strongly-labelled Subset](https://research.google.com/audioset/download_strong.html) 四个子数据集组成,每个数据集里都有不同的字段。例如SoundBible里包含了‘description’字段,而该字段在AudioSet里并不存在。为了保证不同子数据集在转换后能够正常合并,在wavcaps_to_dj阶段使用了所有子数据集字段的并集,并在dj_to_wavcaps阶段完整保留了所有字段。
```json
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# This tool is used to convert multimodal dataset in Data-Juicer format to a
# target dataset in LLaVA format.
# target dataset in LLaVA-like format.
#
# Corresponding Data-Juicer format:
# - multi-chunk interleaved image-text sequence
Expand Down Expand Up @@ -101,7 +101,7 @@ def main(
extra argument original_llava_ds_path is required. When the processed
and converted dataset will be used in another machine, it's better to
set this argument to True. Default: False.
:param original_llava_ds_path: path to the original unprocessed llava
:param original_llava_ds_path: path to the original unprocessed LLaVA
dataset, which is used to help to recover the relative image paths for
better migration. Default: None.
"""
Expand Down
Loading