Add chunking to ds_tool #97

liPatrick · 2024-08-23T00:33:08Z

Adding chunking to ds_tool so it is more resilient to processing/upload failures for larger datasets.

Dynamically chunks dataset to process and upload. Failed chunks get reprocessed into smaller chunks and recursively go through the same process. Repeated until the chunk_split_threshold is reached, then it gets uploaded as is.

Note:
Push_to_hub automatically shards chunks into roughly 500mb. So in huggingface there will be a chunk and shard denominator, but we can just use a wildcard to set the path for our splits in the README, ie subset_name/split_name/**

ultravox/data/text_proc.py

ultravox/tools/ds_tool/ds_tool.py

farzadab

Loved the DatasetChunkProcessor abstraction. It cleanly separates the concerns without increasing complexity in other areas of the code.

I left too many nit comments, feel free to ignore most, but 2 3 comments are very important:

I feel like _upload would rewrite the whole set which is not desired and
I didn't fully see the "recursive" logic that I thought you mentioned in TGIF. I expected a different approach there.
Make sure dataset.py can handle the Error classes gracefully.

ultravox/tools/ds_tool/ds_tool.py

ultravox/data/datasets.py

ultravox/tools/ds_tool/ds_tool.py

farzadab

Thanks Pat.
Left some comments, but approved to unblock you.

ultravox/tools/ds_tool/ds_tool.py

ultravox/tools/ds_tool/chunked_dataset.py

liPatrick added 5 commits August 22, 2024 17:31

Added chunking

81bb206

Dynamic chunking

3ba9ff9

Raise error in jinja template

2ce70f3

Format errors

168275b

Fix test

daa2ce9

liPatrick marked this pull request as ready for review August 23, 2024 18:23

liPatrick requested review from zqhuang211 and farzadab August 23, 2024 18:27

Return sample when format is wrong

31a8a13

zqhuang211 reviewed Aug 23, 2024

View reviewed changes

ultravox/data/text_proc.py Outdated Show resolved Hide resolved

zqhuang211 reviewed Aug 23, 2024

View reviewed changes

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

zqhuang211 reviewed Aug 23, 2024

View reviewed changes

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

Remove template failures counter, doenst work with multi-proc

b1f6d67

farzadab reviewed Aug 26, 2024

View reviewed changes

liPatrick added 2 commits August 26, 2024 16:30

Addressing comments

4163628

Handle text proc asr error in dataset.py

255c46d

liPatrick commented Aug 26, 2024

View reviewed changes

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

liPatrick added 4 commits August 26, 2024 16:39

removing extra prints

f1236ce

Make process upload split recurisve

eab6b80

Add more comments

4e42d58

More comments

93c25bf

farzadab reviewed Aug 27, 2024

View reviewed changes

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

ultravox/tools/ds_tool/ds_tool.py Outdated Show resolved Hide resolved

Use None instead of empty quotes. Type issue resolved

1479d91