Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] video-text data curation first ver. #826

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

AndyZhou952
Copy link
Contributor

@AndyZhou952 AndyZhou952 commented Jan 16, 2025

Currently support:

  • Scene Detection via PySceneDetect
  • Deduplication via imagededup
  • Option Filtering/Caption Matching via CLIP score
  • OCR Filtering via MindOCR
  • PLLaVA Captioning via MindNLP (REMARK: will be substituted later; the updated ver will NOT depend on MindNLP)
  • Aesthetic Scoring via LAION Aesthetic Scorer
  • Filtering Heuristic + Dataset Management tools
  • YAML config to run at one go

TODOs:

  • Precision Check & Fix
  • Add better video-splitting techniques
  • Add deduplication enhancement (e.g. ISC model)
  • More captioners support
  • JSON support
  • Data parallel full support (OCR working)
  • Provide a tutorial and sample videos demo

semantic consistency. You may refer to the
[Further Reading](#further-reading) section for more details.

## Requirement:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm :


![pipeline](./assets/data_pipeline_baseline.png)

## Overview
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a section to show the overall planned/support features. e.g.

  • video de-duplication
    • method1: ISC
    • method2: xxx
  • aesthetic filtering
  • motion filtering
  • NSFW filtering
  • multi-NPU processing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants