[Draft] video-text data curation first ver. #826

AndyZhou952 · 2025-01-16T02:09:42Z

Currently support:

Scene Detection via PySceneDetect
Deduplication via imagededup
Option Filtering/Caption Matching via CLIP score
OCR Filtering via MindOCR
PLLaVA Captioning via MindNLP (REMARK: will be substituted later; the updated ver will NOT depend on MindNLP)
Aesthetic Scoring via LAION Aesthetic Scorer
Filtering Heuristic + Dataset Management tools
YAML config to run at one go

TODOs:

Precision Check & Fix
Add better video-splitting techniques
Add deduplication enhancement (e.g. ISC model)
More captioners support
JSON support
Data parallel full support (OCR working)
Provide a tutorial and sample videos demo

SamitHuang · 2025-01-24T08:34:28Z

tools/t2v_curation/README.md

+semantic consistency. You may refer to the 
+[Further Reading](#further-reading) section for more details.
+
+## Requirement:


SamitHuang · 2025-01-24T08:43:06Z

tools/t2v_curation/README.md

+
+ ![pipeline](./assets/data_pipeline_baseline.png)
+
+## Overview


Add a section to show the overall planned/support features. e.g.

video de-duplication

method1: ISC

method2: xxx

aesthetic filtering

motion filtering

NSFW filtering

multi-NPU processing

t2v data curation first ver.

d7f2a07

SamitHuang reviewed Jan 24, 2025

View reviewed changes

AndyZhou952 added 2 commits February 3, 2025 17:16

rm : in README

0c490d0

aesthetic scorer mindone clip substitution

2a23eee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] video-text data curation first ver. #826

[Draft] video-text data curation first ver. #826

AndyZhou952 commented Jan 16, 2025 •

edited

Loading

SamitHuang Jan 24, 2025

SamitHuang Jan 24, 2025

[Draft] video-text data curation first ver. #826

Are you sure you want to change the base?

[Draft] video-text data curation first ver. #826

Conversation

AndyZhou952 commented Jan 16, 2025 • edited Loading

SamitHuang Jan 24, 2025

Choose a reason for hiding this comment

SamitHuang Jan 24, 2025

Choose a reason for hiding this comment

AndyZhou952 commented Jan 16, 2025 •

edited

Loading